Open gohilankit opened 8 years ago
Hi @gohilankit
There are several ways to specify filters, the easiest one is just using the filter function
Thus, you should do something like data.filter(df.age > 3).collect(). The filter would be pushed down to Mongo when possible.
DataFrames are lazy, so will be executed when the Spark action (collect, first, take...) is performed
Do you means that by this, the Spark will load the filtered data instead of the whole data set of the target collection? So as the Spark SQL performed like this?
I am using this API to query from a large Mongo DB collection. Is there any way I can specify query filters to load selected documents as dataframe and not the whole collection. Probably some kind of equivalent of find({'key':'value'}) or more complex queries in MongoDB) I'm currently using version spark-mongodb_2.10:0.11.0. I'm querying in PySprak using the below command and the load() method would just load the whole collection, which is taking a lot of time.
reader = sqlContext.read.format("com.stratio.datasource.mongodb") data = reader.options(host='10.219.51.10:27017', database='ProductionEvents', collection='srDates').load()