Stratio / Spark-MongoDB

Spark library for easy MongoDB access
http://www.stratio.com
Apache License 2.0
307 stars 99 forks source link

Question - Applying query filters while loading collections #144

Open gohilankit opened 8 years ago

gohilankit commented 8 years ago

I am using this API to query from a large Mongo DB collection. Is there any way I can specify query filters to load selected documents as dataframe and not the whole collection. Probably some kind of equivalent of find({'key':'value'}) or more complex queries in MongoDB) I'm currently using version spark-mongodb_2.10:0.11.0. I'm querying in PySprak using the below command and the load() method would just load the whole collection, which is taking a lot of time.

reader = sqlContext.read.format("com.stratio.datasource.mongodb") data = reader.options(host='10.219.51.10:27017', database='ProductionEvents', collection='srDates').load()

darroyocazorla commented 8 years ago

Hi @gohilankit

There are several ways to specify filters, the easiest one is just using the filter function

Thus, you should do something like data.filter(df.age > 3).collect(). The filter would be pushed down to Mongo when possible.

DataFrames are lazy, so will be executed when the Spark action (collect, first, take...) is performed

DeeeFOX commented 8 years ago

Do you means that by this, the Spark will load the filtered data instead of the whole data set of the target collection? So as the Spark SQL performed like this?