Predictive model to find optimal reduction parameters

balanz24 commented 1 month ago

This is a possible solution to #418

Our model aims to predict the optimal split_every value that makes the reduction as fast as possible. This parameter affects the input data size of each function, the total number of stages and the number of functions x stage.

Evaluation has only be done in Lithops, but should be extended to further backends.

The model predicts 3 components of the execution time separately:

Invocation: The time it takes for the functions to start executing cubed code since the parallel map job is submited.
I/O: The time that functions spend reading and writing zarr files from/to object storage.
CPU: The cpu time that functions spend performing reduction computations.

Invocation and CPU times are easy to predict using linear regression, as they increase linearly as the dataset to reduce increases. As for the I/O time, it is predicted using the primula-plug presented in Primula paper.

Here we see a comparison of the real vs predicted times in a quadratic means test of 15 GB. This has been measured using lithops on AWS Lambda and S3.

As we can see the model is able to predict the optimal split_every=4 which gives the lowest execution time.

Some observations on the results:

Invocation overheads have a very significant weight over the total time, but further backends remain to be evaluated to see if they can be lower.
Since the CPU time seems to be insignificant, the model could be integrated into cubed only considering I/O and invocation overheads.

tomwhite commented 1 month ago

Thanks for doing this work @balanz24!

It would be interesting to see if the results changed with larger datasets on the same quadratic means. In particular, does the optimal value of split_every increase once the number of tasks exceeds the number of workers (1000 on AWS Lambda)?

Making this easy to use for Cubed users, or integrating it as a plugin would be a great addition.

balanz24 commented 1 month ago

During this week I've been testing the model with larger datasets and the results look promising.

Particulary I've used a >300GB dataset, setting optimize_graph=False to avoid fusing operations in order to have stages with more than 1000 workers, as you suggested. The predictions obtained are farther from the real values compared to smaller datasets, but the trend remains the same. It is able to find the optimal split_every, which is indeed increasing (around 6 to 8 in this case).

The next steps would be:

Adapting the model to work with modal backend.
Checking if the model would also work predicting the cost, not only execution time.
Simplifying the model in order to integrate it to cubed (we can discuss it in the next meeting).

TomNicholas commented 1 month ago

we can discuss it in the next meeting)

FYI we're gonna skip the meeting this coming Monday - see https://discourse.pangeo.io/t/new-working-group-for-distributed-array-computing/2734/56?u=tomnicholas

tomwhite commented 1 month ago

Particulary I've used a >300GB dataset, setting optimize_graph=False to avoid fusing operations in order to have stages with more than 1000 workers, as you suggested.

I wouldn't set optimize_graph=False as this will avoid doing any optimization. What I was suggesting was to scale up so the number of chunks in the input was over 1000, so that all workers were used. Even with fusion there would still be over 1000 tasks at the first stage of the computation.

cubed-dev / cubed

Predictive model to find optimal reduction parameters #459