Closed renatobellotti closed 2 years ago
See autonomio/talos#55, autonomio/talos#100, and autonomio/talos#131 for more info on this. I think Horodov will end up being the solution we adopt for this.
+1 on Horovod, I am currently trying to figure out how to use it in hyperparameter search, will report back if I have something worth sharing.
Wonderful. This would be a great contribution!
Parallelism is currently the most common label in the open issues, so I think there will be some action in this front soon. Maybe for 0.6.4(ish) something already.
Related: autonomio/talos#100 autonomio/talos#131
@mikkokotila I'm willing to help test new features if we get a branch going. I have multiple GPUs available to run model-data parallelism.
@awilliamson perfect and thanks. I will update here once I get started.
Have you had time to work on this yet?
@renatobellotti not yet, but I'm getting a new machine in the next week or so, which will be a great time to get started :)
Excellent, glad to hear that! :)
I'd like to make a suggestion: Not everybody has access to GPU clusters and/or cloud infrastructure (privacy, independence of research etc.). Therefore, it'd be very valuable to go for a flexibel approach, allowing parallel execution on both CPU and GPU clusters. It seems to be possible to run Tensorflow sessions in parallel, perhaps this can be used in combination with a task queue library or so...
I 100% agree with support for CPU multi-tenant parallelism. In fact, many common data science challenges run faster on CPU than high-end GPU (!?)
Another thought that came to me: Probably, the most flexible solution could be to implement the model training/evaluation tasks as jobs in a scheduling system, e. g. slurm. Then the user can provide information whether CPU/GPU nodes should be used etc.
Has there been any progress?
Another thought that came to me: Probably, the most flexible solution could be to implement the model training/evaluation tasks as jobs in a scheduling system, e. g. slurm. Then the user can provide information whether CPU/GPU nodes should be used etc.
I think connecting to something like slurm would be super useful!
Merging with autonomio/jako#10
Is there a way to perform different scan tasks in parallel?