Open ShaneHarvey opened 2 years ago
It would be even better to be able to configure the npartitions directly rather than the chunksize. The we can remove the duplicate query for the $count. Something like:
# Read Dask Bag from Mongo database
b = dask_mongo.read_mongo(
database="your_database",
collection="your_collection",
connection_kwargs={"host": "localhost", "port": 27017},
npartitions=16,
)
Hey @ShaneHarvey thanks for bringing this to our attention. In general, I agree that specifying npartitions
would be better, but the example in the README is very small and self-contained. Do you think that it would be better to generate a bigger example dask bag at the risk of making the example less concrete and grok-able? Or perhaps we could add comments next to the chunksize
and npartitions
kwargs
Adding an inline comment would be a great start, maybe something like chunksize=2, # Increase this value for better performance.
. I still worry about using a poor default value even if it's a toy example.
Support for the npartitions feature would be a nice addition since I believe it would scale better in most cases. When npartitions is added, we could update the readme to use npartitions=16. Sure, the same problem of using a poor default value exists here (maybe npartitions=4 or npartitions=50 is better for a given query) but I imagine npartitions=16 is much better than chunksize=2 in general.
The example in readme should not use chunksize=2:
IIUC a chunksize of 2 will cause read_mongo to execute 1 query for every 2 documents in the result set, with 1000 docs there will be 500 queries. Let's use a realistic value of chunksize (maybe 10,000?) to avoid users copy/pasting a poor default value.