Open Peter9192 opened 3 years ago
Another use case: if I run multiple model instances on a 24-core node on Cartesius like so:
for model in models:
model.initizalize()
while model.time<model.end_time
model.update()
What is the most straightforward way to execute this job in parallel, e.g. 1 model instance per core?
Could we simply use multiprocessing? Or is there a better way? And would it be possible to add an example of this use case to the documentation?
Multiprocessing will still be within same machine. If we implement something parallel I would like it to be also distributed with something like https://docs.dask.org/, https://ray.io or https://docs.celeryproject.org/
Running models in parallel works, however it only works when no data has to be serialized and transferred to other processes (i.e. no dask or mutliprocessing, but threading works). Edit: this might have been just for the Wflow.jl case as it dependent on connecting to Julia.
However, as the models inside docker/apptainer that isn't a big hurdle.
@Daafip would you be able to add an example if/when you have this working?
Looking at Data Assmilation (DA), running models in parrallel (faster!) would be great. Currently working on, first implementing a set of classes to run DA . Example here.
I have a crude example using tqdm
working in a notebook. Here many models compute their output in parrallel.
From early testing I didn't find much benefit from that as of yet , but I put in no real effort just yet.
I'm following the 'Parallel Programming with Python' NLeSc course next week & will after that take a good look.
From early testing I didn't find much benefit from that as of yet
I would expect that in your case, as the runtime for the HBV model is very short. It should be different when models take longer to compute their .update()
, for example, distributed models. If you would like to test if the parallelization actually works you could add a sleep(5)
statement inside the model code.
would you be able to add an example if/when you have this working?
I'm currently looking into this. My proposed structure would look something like this where the greyed out part is only when running data assimilation whilst the rest can be used when just running an ensemble of models.
If you would like to test if the parallelization actually works you could add a
sleep(5)
statement inside the model code.
Added a test model instead, won't add it to PyPi as its more for development purposes. Maybe slightly overkill but allows testing inside a docker container too, which adds some overhead. Can be found here
Got a 10x theoretical speed up now using dask,delayed, example here.
Next to try and speed up the rest of the data assimilation steps as getting and setting the whole state vector also could be optimized.
Thanks for sharing the progress!
Got a 10x theoretical speed up now using dask,delayed
Did you use Dask's default scheduler? There is also the "distributed" dask client, which they generally recommend nowadays https://docs.dask.org/en/stable/scheduler-overview.html The exact speed up with dask will depend on the Dask scheduler & configuration, but also the system you're on. If you have 10 threads on your CPU, the max speed up you can get is 10x.
Did you use Dask's default scheduler?
Yeah for now.
Thanks, will look into the customisation: necessary for larger applications. Was more a proof of concept. Fair point on the 10 threads, I think it defaulted to 12 but I did short tests so the amount of overhead is still significant.
_Edit: was a quick fix indeed limited by num_workers
: changed in example & implementation_
For DA (& other applications) getting and setting states also induces quite some run time so will look into parralellising this too.
For the use case where we run two (or more) experiments with minor differences in one notebook, it would be really nice if they could execute in parallel. E.g.
There are different ways to accomplish this; it would be nice if we could offer an easy way for the user using the ewatercycle interface, but this example shows it should at least be flexible enough to have custom statements in the second loop.