coiled / dask-mongo

BSD 3-Clause "New" or "Revised" License
19 stars 9 forks source link

Add demo example #10

Closed ncclementi closed 3 years ago

ncclementi commented 3 years ago

This notebook contains a demo example of a write/read process to Mongo Atlas, using pymongo and dask-mongo. Unfortunately we can't run bigger examples due to M0 free tiers limitations.

mrocklin commented 3 years ago

Thanks for putting this together @ncclementi . Here are some small comments

So, if I were to do this I might try the following:

Dask and MongoDB

Example dataset, AirBnB listings

https://docs.atlas.mongodb.com/sample-data/sample-airbnb/#std-label-sample-airbn

We've set up MongoDB with Atlas, and loaded the AirBnB listings dataset. If you don't know how to do this, check out these links

Let's connect to our database and look at a sample element

# a tiny bit of Python code

Size

Unfortunately, this dataset is large

# tiny query to check count

It's ok, Mongo includes the great [aggregation pipeline query language]() to help perform queries natively in Mongo. However some folks would prefer using Python, either because they want to use native Python functionality or because they like tools like Pandas. They could pull out the dataset with functions like ... , but if it becomes too large then they have a problem.

Use Dask to read from Mongo

Dask-mongo solves this problem

# tiny dask-mongo code to get out a bag
# tiny code to do some simple operation

Dask dataframes

Text around why people like Pandas,

# code to switch to a dask dataframe, and then some sample queries

Deployment

You may have noticed that we used Coiled above. Atlas and Coiled make it easy to run Mongo and Dask respectively in the cloud. We did take some care to ensure that the regions were the same. We're also grateful that concerns like cloud provisioning and security are handled for us automatically. This makes makes cloud storage and cloud computing resources significantly more accessible.

Summary

Some final thoughts ...

mrocklin commented 3 years ago

Some meta thoughts on my approach above. I apologize if you know all of this stuff already or disagree. I get pretty pedantic when it comes to reader attention span (see https://matthewrocklin.com/blog/work/2020/07/13/brevity)

I think that we need to give them something flashy quickly to hook them. In this case that's the AirBnB sample data (it looks like this). I think that it is important that this is above the fold.

We show value, but then present a problem, size.

We then break out the solution, Dask. We want to get to this state quickly. This is why the reader is here so we want to give them what they want soon. This is like Bond movies always starting with a chase scene.

After this point I think that we're good. We've built up value and can educate them on stuff. I think that we don't necessarily want to get too deep just yet though (I would probably drop writing and read-with-match (although I just learned that we could do this, cool feature)). Those seem useful, but probably something for docs or a follow-on notebook.

mrocklin commented 3 years ago

We can also leave this for the evangelism team if you prefer. I think that what you have here is more than enough to get someone up and running. My guess though is that you'd like to engage on this, in which case grand, I look foward to iterating.

ncclementi commented 3 years ago

Thanks for putting this together @ncclementi . Here are some small comments

Thanks for the comments @mrocklin this is very helpful. Couple of things that might have gotten lost that I think are relevant.

It looks like Mongo/Atlas includes some sample data. I recommend that we use that: https://docs.atlas.mongodb.com/sample-data/#std-label-load-sample-data . This should both provide us with a larger dataset, make things a bit more relevant, and allow us to skip some header material and get to substance more quickly.

They do have data examples, it's a good idea to use them. They are not necessarily that big but we can give it a try.

I think that we should write this for an audience who has familiarity with Mongo. As a result, we can probably skip some of the explanatory material around connecting. This is like when people who don't know Python well give tutorials and say "Python has a package manager pip, which ..." and explains things in depth. Most Python people find this less than useful.

This makes sense, I wasn't sure exactly sure which audience we were targeting. But if we assume familiarity with Mongo then this is a valid point.

We might want to think up a reason why people would want to use Dask and Mongo together. I did a brief search for PySpark on the Mongo Blog and came up with a few things that seem interesting
https://www.mongodb.com/blog/post/exploring-data-with-mongo-db-atlas-databricks-and-google-cloud https://www.mongodb.com/blog/post/using-mongodb-hadoop-spark-part-1-introduction-setup (links to parts 2 and 3 exist) This is probably going to deep though. Maybe just convert to a dask dataframe (see the Bag.to_dask_dataframe method) and then do a groupby aggregation

I agree that we need a reason. So far if I read all the data let's say in the Airbnb sample data set and compare the time using pure pymongo, and the time using read_mongo().compute there is no difference and in some cases using dask is slower. I believe the usage advantage would be by doing operations on dask.bag land or converting that to a dask dataframe. For the later (bag -> ddf) this requires work from the user in passing a meta and columns. Going back to the AIrbnb data sample I tried to do to_dataframe with specifying anything and I got some errors

So, if I were to do this I might try the following:

Dask and MongoDB

Example dataset, AirBnB listings

https://docs.atlas.mongodb.com/sample-data/sample-airbnb/#std-label-sample-airbn

We've set up MongoDB with Atlas, and loaded the AirBnB listings dataset. If you don't know how to do this, check out these links

* link 1

* link 2

Let's connect to our database and look at a sample element

# a tiny bit of Python code

Size

Unfortunately, this dataset is large

Here is the thing, it's not that big, reading the whole thing using pymongo takes 9 sec. It's about 5500 documents (~90MB total). All the sample data sets are relatively small So you can load them on an M0 instance which has storage limit of 500MB

# tiny query to check count

It's ok, Mongo includes the great aggregation pipeline query language to help perform queries natively in Mongo. However some folks would prefer using Python, either because they want to use native Python functionality or because they like tools like Pandas. They could pull out the dataset with functions like ... , but if it becomes too large then they have a problem.

Use Dask to read from Mongo

Dask-mongo solves this problem

# tiny dask-mongo code to get out a bag
# tiny code to do some simple operation

Dask dataframes

Text around why people like Pandas,

# code to switch to a dask dataframe, and then some sample queries

I like the structure, but with the examples I've been able test the code doesn't tell the same story.

Deployment

You may have noticed that we used Coiled above. Atlas and Coiled make it easy to run Mongo and Dask respectively in the cloud. We did take some care to ensure that the regions were the same. We're also grateful that concerns like cloud provisioning and security are handled for us automatically. This makes makes cloud storage and cloud computing resources significantly more accessible.

Summary

Some final thoughts ...

ncclementi commented 3 years ago

Some meta thoughts on my approach above. I apologize if you know all of this stuff already or disagree. I get pretty pedantic when it comes to reader attention span (see https://matthewrocklin.com/blog/work/2020/07/13/brevity)

Not a problem, I think now I have a clear idea on where are we going with this.

I think that we need to give them something flashy quickly to hook them. In this case that's the AirBnB sample data (it looks like this). I think that it is important that this is above the fold.

We show value, but then present a problem, size.

We then break out the solution, Dask. We want to get to this state quickly. This is why the reader is here so we want to give them what they want soon. This is like Bond movies always starting with a chase scene.

Here is were I'm struggling, I haven't found an example that conveys this. Maybe I'm missing something but by simple reading the whole data set (Airbnb), dask-mongo doesn't seem to be doing a better job than pure pymongo. Using pymongo

%%time
my_records = list(collection.find()) 
CPU times: user 836 ms, sys: 337 ms, total: 1.17 s
Wall time: 9.37 s

Using dask-mongo

read_bag = read_mongo(connection_kwargs={"host": host}, 
                database="sample_airbnb", 
                collection="listingsAndReviews",
                chunksize=500)
%%time
records = read_bag.compute()
CPU times: user 1.82 s, sys: 195 ms, total: 2.02 s
Wall time: 9.19 s
mrocklin commented 3 years ago

there is no difference and in some cases using dask is slower

Yeah, to be clear I wouldn't expect full dataset reads to be the biggest draw here. This also isn't a behavior that we want to encourage. We want data to stay remote if possible. I also suspect that we're entirely bound by your internet connection here.

I recommend pulling into a dask dataframe and doing a groupby aggregation. We can do that with mongo too, but it's hard.

Going back to the AIrbnb data sample I tried to do to_dataframe with specifying anything and I got some errors

You may have to select only a subset of the columns, or do some preprocessing in order to flatten things down a bit.

Here is an example with github data: https://gist.github.com/mrocklin/02ac8bbb0c671459644faed4146820c1

ncclementi commented 3 years ago

there is no difference and in some cases using dask is slower

Yeah, to be clear I wouldn't expect full dataset reads to be the biggest draw here. This also isn't a behavior that we want to encourage. We want data to stay remote if possible. I also suspect that we're entirely bound by your internet connection here.

Perfect, this makes much more sense now.

I recommend pulling into a dask dataframe and doing a groupby aggregation. We can do that with mongo too, but it's hard.

Going back to the AIrbnb data sample I tried to do to_dataframe with specifying anything and I got some errors

You may have to select only a subset of the columns, or do some preprocessing in order to flatten things down a bit.

Here is an example with github data: https://gist.github.com/mrocklin/02ac8bbb0c671459644faed4146820c1

Thanks for sharing that example, this is very useful. I'll try to do something like.

mrocklin commented 3 years ago

They do have data examples, it's a good idea to use them. They are not necessarily that big but we can give it a try.

I think that ideally we would have something large, but if that's going to take time then let's just skip this and move on. I recommend that we go ahead with what we have and leave finding something larger to the evangelism team.

My hope is that we'll be able to finish up a basic notebook that Sales can use by end of day, and then move on to bigger and better things :) (BigQuery?)

ncclementi commented 3 years ago

They do have data examples, it's a good idea to use them. They are not necessarily that big but we can give it a try.

I think that ideally we would have something large, but if that's going to take time then let's just skip this and move on. I recommend that we go ahead with what we have and leave finding something larger to the evangelism team.

Unfortunately, if we will rely on the free Mongo cluster (M0 tier), this won't be possible. Having something larger would imply finding bigger data and hosting the data in a different Atlas instance.

mrocklin commented 3 years ago

Yeah, agreed. Sorry, my previous comment was intended to say "let's just bail on a big data example and go with what we have"

koverholt commented 3 years ago

Thank you both for iterating on this example notebook. I ran the latest version and it worked perfectly on my Atlas account.

My only request would be to add a couple of cells to create a Coiled cluster and run the computations on that. 😄

mrocklin commented 3 years ago

More feedback!

ncclementi commented 3 years ago

feedback taken, will update soon. Few things to comment back

* Currently we're commenting out the b.take(1).  Is this because it's large and ungainly?

Yes it's super long and complicated to read.

* I would merge all of cells 14-17.  Also, I recommend using `dask.dataframe.DataFrame.to_bag`

Using dask.dataframe.DataFrame.to_bag doesn't work in this case. to_bag returns a list of tuples with the values of each row in the dataframe. While we need something that has this format, to work nicely with mongo.

[
    {"name": "Alice", "fruit": "apricots"},
    {"name": "Bob", "fruit": ["apricots", "cherries"]},
    {"name": "John", "age": 17, "sports": "cycling"},
]

Therefore the complicated work around to go from dataframe to bag. If there is a simpler version I'm happy to change it.

mrocklin commented 3 years ago

Let's raise an issue to add this to dask bag. Dictionaries should be a common output format. Can I ask that you to raise an issue at github.com/dask/dask/issues/new ?

ncclementi commented 3 years ago

Let's raise an issue to add this to dask bag. Dictionaries should be a common output format. Can I ask that you to raise an issue at github.com/dask/dask/issues/new ?

Yes, It's already on my TODO list : ) . Will get on it soon.

mrocklin commented 3 years ago

(I also realize that it's late. Please take my comments here as suggestions for stuff to do tomorrow rather than today)

ncclementi commented 3 years ago

If the demo is good as is, and we don't need anymore changes should we go ahead an merge it? @jrbourbeau Do you have any comments on the demo, content/style/etc. or something you would like to add?