Question - Applying query filters while loading collections

gohilankit commented 8 years ago

I am using this API to query from a large Mongo DB collection. Is there any way I can specify query filters to load selected documents as dataframe and not the whole collection. Probably some kind of equivalent of find({'key':'value'}) or more complex queries in MongoDB) I'm currently using version spark-mongodb_2.10:0.11.0. I'm querying in PySprak using the below command and the load() method would just load the whole collection, which is taking a lot of time.

reader = sqlContext.read.format("com.stratio.datasource.mongodb") data = reader.options(host='10.219.51.10:27017', database='ProductionEvents', collection='srDates').load()

darroyocazorla commented 8 years ago

Hi @gohilankit

There are several ways to specify filters, the easiest one is just using the filter function

Thus, you should do something like data.filter(df.age > 3).collect(). The filter would be pushed down to Mongo when possible.

DataFrames are lazy, so will be executed when the Spark action (collect, first, take...) is performed

DeeeFOX commented 8 years ago

Do you means that by this, the Spark will load the filtered data instead of the whole data set of the target collection? So as the Spark SQL performed like this?

Stratio / Spark-MongoDB

Question - Applying query filters while loading collections #144