Distribute vegasflow on clusters with dask

N3PDF / vegasflow

VegasFlow: accelerating Monte Carlo simulation across multiple hardware platforms

https://vegasflow.readthedocs.io

Apache License 2.0

34 stars 9 forks source link

Distribute vegasflow on clusters with dask #55

Closed scarlehoff closed 4 years ago

scarlehoff commented 4 years ago

This seems to work in one single computer. I'll try it in galileo as soon as I am able to.

As far as I understand the point of the matter is to have a run_event per distributed system where the dask client connects to the appropriate one.

The way it will work is by sending a job per chunk of data while the master node / central server collects all data and gives you the results.

At first I thought "this is so simple we should use this instead of joblib" but then I realised not only complicates pickability and device selection but also the distribute package from pip was not working in dom... (the one from Arch is) so for now I prefer to keep it as a completely separate option.

scarrazza commented 4 years ago

Great, looks good. I will give a try in other IT infrastructures.

scarlehoff commented 4 years ago

In indaco I've had the same problems as with dom so definitely not having this in the main package. The pickle is also very tricky and it only seems to work with tf > 2.2

In any case, the way this needs to be done is by passing a dask cluster object to, for instance, the compile call. I'll have some example and then I'll have the docs point to the list of supported systems from dask

The advantage is that by doing that we are compatible with all queue system dask is.

The latest commit is working in indaco. I have to say I'm very happy with dask, other than the expected pitfalls when passing around objects through sockets everything works as advertised.

scarlehoff commented 4 years ago

This is ready for review. If you have access to a non-slurm workload manager it would be helpful to have a second example. If not I think this one is enough.

scarrazza commented 4 years ago

Very good, I have tried the local cluster (the dask monitor panel seems to work fine) and the PBSCluster, both cases are working fine. Just wondering if we have some multi-GPU nodes in some cluster (maybe marco?), if not we can try to rent and configure slurm on some cloud machines.

scarlehoff commented 4 years ago

Even in that case you would want to send two jobs to that node. I haven't even tried to make dask + multiGPU work at the same time because it seems redundant to me (and because it scares me tbh).

scarlehoff commented 4 years ago

If you are happy with this, I'll merge.

scarrazza commented 4 years ago

Fine by me, and the instructions are clear.