coiled / dask-mongo

BSD 3-Clause "New" or "Revised" License
19 stars 9 forks source link

Example in readme should not use chunksize=2 #25

Open ShaneHarvey opened 2 years ago

ShaneHarvey commented 2 years ago

The example in readme should not use chunksize=2:

# Read Dask Bag from Mongo database
b = dask_mongo.read_mongo(
    database="your_database",
    collection="your_collection",
    connection_kwargs={"host": "localhost", "port": 27017},
    chunksize=2,
)

IIUC a chunksize of 2 will cause read_mongo to execute 1 query for every 2 documents in the result set, with 1000 docs there will be 500 queries. Let's use a realistic value of chunksize (maybe 10,000?) to avoid users copy/pasting a poor default value.

ShaneHarvey commented 2 years ago

It would be even better to be able to configure the npartitions directly rather than the chunksize. The we can remove the duplicate query for the $count. Something like:

# Read Dask Bag from Mongo database
b = dask_mongo.read_mongo(
    database="your_database",
    collection="your_collection",
    connection_kwargs={"host": "localhost", "port": 27017},
    npartitions=16,
)
phobson commented 1 year ago

Hey @ShaneHarvey thanks for bringing this to our attention. In general, I agree that specifying npartitions would be better, but the example in the README is very small and self-contained. Do you think that it would be better to generate a bigger example dask bag at the risk of making the example less concrete and grok-able? Or perhaps we could add comments next to the chunksize and npartitions kwargs

ShaneHarvey commented 1 year ago

Adding an inline comment would be a great start, maybe something like chunksize=2, # Increase this value for better performance.. I still worry about using a poor default value even if it's a toy example.

Support for the npartitions feature would be a nice addition since I believe it would scale better in most cases. When npartitions is added, we could update the readme to use npartitions=16. Sure, the same problem of using a poor default value exists here (maybe npartitions=4 or npartitions=50 is better for a given query) but I imagine npartitions=16 is much better than chunksize=2 in general.