Closed robin-cls closed 1 year ago
When inserting partitions, Dask parallelizes the writing of each partition across its workers. Additionally, the writing of variables within a partition is parallelized on the worker responsible for inserting that partition, using multiple threads. If you're using a single Dask worker, partition insertion will happen sequentially. We'll update the documentation to make this clearer.
According to the documentation
A partition_size set to 1 means that we map the update function over each partition. However, it seems not to be the case. The following example uses a local cluster to parallelize a dummy function. Without the partition_size argument, it is properly sent to 6 workers. However, when setting it to 1, it is run in sequential over only one worker :
It seems that the partition_size is instead used as the number of batches instead of the number of partitions in each batch : Link batch sequence
What is the intended use of the partition_size argument ?