Open balanz24 opened 1 month ago
Thanks for doing this work @balanz24!
It would be interesting to see if the results changed with larger datasets on the same quadratic means. In particular, does the optimal value of split_every
increase once the number of tasks exceeds the number of workers (1000 on AWS Lambda)?
Making this easy to use for Cubed users, or integrating it as a plugin would be a great addition.
During this week I've been testing the model with larger datasets and the results look promising.
Particulary I've used a >300GB dataset, setting optimize_graph=False
to avoid fusing operations in order to have stages with more than 1000 workers, as you suggested. The predictions obtained are farther from the real values compared to smaller datasets, but the trend remains the same. It is able to find the optimal split_every
, which is indeed increasing (around 6 to 8 in this case).
The next steps would be:
we can discuss it in the next meeting)
FYI we're gonna skip the meeting this coming Monday - see https://discourse.pangeo.io/t/new-working-group-for-distributed-array-computing/2734/56?u=tomnicholas
Particulary I've used a >300GB dataset, setting
optimize_graph=False
to avoid fusing operations in order to have stages with more than 1000 workers, as you suggested.
I wouldn't set optimize_graph=False
as this will avoid doing any optimization. What I was suggesting was to scale up so the number of chunks in the input was over 1000, so that all workers were used. Even with fusion there would still be over 1000 tasks at the first stage of the computation.
This is a possible solution to #418
Our model aims to predict the optimal
split_every
value that makes the reduction as fast as possible. This parameter affects the input data size of each function, the total number of stages and the number of functions x stage.Evaluation has only be done in Lithops, but should be extended to further backends.
The model predicts 3 components of the execution time separately:
Invocation and CPU times are easy to predict using linear regression, as they increase linearly as the dataset to reduce increases. As for the I/O time, it is predicted using the primula-plug presented in Primula paper.
Here we see a comparison of the real vs predicted times in a quadratic means test of 15 GB. This has been measured using lithops on AWS Lambda and S3.
As we can see the model is able to predict the optimal
split_every=4
which gives the lowest execution time.Some observations on the results: