cubed-dev / cubed

Bounded-memory serverless distributed N-dimensional array processing
https://cubed-dev.github.io/cubed/
Apache License 2.0
98 stars 7 forks source link

Predictive model to find optimal reduction parameters #459

Open balanz24 opened 1 month ago

balanz24 commented 1 month ago

This is a possible solution to #418

Our model aims to predict the optimal split_every value that makes the reduction as fast as possible. This parameter affects the input data size of each function, the total number of stages and the number of functions x stage.

Evaluation has only be done in Lithops, but should be extended to further backends.

The model predicts 3 components of the execution time separately:

Invocation and CPU times are easy to predict using linear regression, as they increase linearly as the dataset to reduce increases. As for the I/O time, it is predicted using the primula-plug presented in Primula paper.

Here we see a comparison of the real vs predicted times in a quadratic means test of 15 GB. This has been measured using lithops on AWS Lambda and S3.

As we can see the model is able to predict the optimal split_every=4 which gives the lowest execution time.

Some observations on the results:

tomwhite commented 1 month ago

Thanks for doing this work @balanz24!

It would be interesting to see if the results changed with larger datasets on the same quadratic means. In particular, does the optimal value of split_every increase once the number of tasks exceeds the number of workers (1000 on AWS Lambda)?

Making this easy to use for Cubed users, or integrating it as a plugin would be a great addition.

balanz24 commented 1 month ago

During this week I've been testing the model with larger datasets and the results look promising.

Particulary I've used a >300GB dataset, setting optimize_graph=False to avoid fusing operations in order to have stages with more than 1000 workers, as you suggested. The predictions obtained are farther from the real values compared to smaller datasets, but the trend remains the same. It is able to find the optimal split_every, which is indeed increasing (around 6 to 8 in this case).

The next steps would be:

TomNicholas commented 1 month ago

we can discuss it in the next meeting)

FYI we're gonna skip the meeting this coming Monday - see https://discourse.pangeo.io/t/new-working-group-for-distributed-array-computing/2734/56?u=tomnicholas

tomwhite commented 1 month ago

Particulary I've used a >300GB dataset, setting optimize_graph=False to avoid fusing operations in order to have stages with more than 1000 workers, as you suggested.

I wouldn't set optimize_graph=False as this will avoid doing any optimization. What I was suggesting was to scale up so the number of chunks in the input was over 1000, so that all workers were used. Even with fusion there would still be over 1000 tasks at the first stage of the computation.