Closed scarlehoff closed 4 years ago
Great, looks good. I will give a try in other IT infrastructures.
In indaco I've had the same problems as with dom so definitely not having this in the main package. The pickle is also very tricky and it only seems to work with tf > 2.2
In any case, the way this needs to be done is by passing a dask cluster object to, for instance, the compile call. I'll have some example and then I'll have the docs point to the list of supported systems from dask
The advantage is that by doing that we are compatible with all queue system dask is.
The latest commit is working in indaco. I have to say I'm very happy with dask, other than the expected pitfalls when passing around objects through sockets everything works as advertised.
This is ready for review. If you have access to a non-slurm workload manager it would be helpful to have a second example. If not I think this one is enough.
Very good, I have tried the local cluster (the dask monitor panel seems to work fine) and the PBSCluster, both cases are working fine. Just wondering if we have some multi-GPU nodes in some cluster (maybe marco?), if not we can try to rent and configure slurm on some cloud machines.
Even in that case you would want to send two jobs to that node. I haven't even tried to make dask + multiGPU work at the same time because it seems redundant to me (and because it scares me tbh).
If you are happy with this, I'll merge.
Fine by me, and the instructions are clear.
This seems to work in one single computer. I'll try it in galileo as soon as I am able to.
As far as I understand the point of the matter is to have a
run_event
per distributed system where the dask client connects to the appropriate one.The way it will work is by sending a job per chunk of data while the master node / central server collects all data and gives you the results.
At first I thought "this is so simple we should use this instead of joblib" but then I realised not only complicates pickability and device selection but also the
distribute
package from pip was not working in dom... (the one from Arch is) so for now I prefer to keep it as a completely separate option.