autonomio / talos

Hyperparameter Experiments with TensorFlow and Keras
https://autonom.io
MIT License
1.62k stars 268 forks source link

[FEATURE REQUEST] Parallel execution of the scan #332

Closed renatobellotti closed 2 years ago

renatobellotti commented 5 years ago

Is there a way to perform different scan tasks in parallel?

mikkokotila commented 5 years ago

See autonomio/talos#55, autonomio/talos#100, and autonomio/talos#131 for more info on this. I think Horodov will end up being the solution we adopt for this.

ktokolwiek commented 5 years ago

+1 on Horovod, I am currently trying to figure out how to use it in hyperparameter search, will report back if I have something worth sharing.

mikkokotila commented 5 years ago

Wonderful. This would be a great contribution!

bravenoob commented 5 years ago

https://stackoverflow.com/questions/57027924/talos-multi-gpu-feature

mikkokotila commented 5 years ago

Parallelism is currently the most common label in the open issues, so I think there will be some action in this front soon. Maybe for 0.6.4(ish) something already.

Related: autonomio/talos#100 autonomio/talos#131

awilliamson commented 5 years ago

@mikkokotila I'm willing to help test new features if we get a branch going. I have multiple GPUs available to run model-data parallelism.

mikkokotila commented 5 years ago

@awilliamson perfect and thanks. I will update here once I get started.

renatobellotti commented 5 years ago

Have you had time to work on this yet?

mikkokotila commented 5 years ago

@renatobellotti not yet, but I'm getting a new machine in the next week or so, which will be a great time to get started :)

renatobellotti commented 5 years ago

Excellent, glad to hear that! :)

I'd like to make a suggestion: Not everybody has access to GPU clusters and/or cloud infrastructure (privacy, independence of research etc.). Therefore, it'd be very valuable to go for a flexibel approach, allowing parallel execution on both CPU and GPU clusters. It seems to be possible to run Tensorflow sessions in parallel, perhaps this can be used in combination with a task queue library or so...

mikkokotila commented 5 years ago

I 100% agree with support for CPU multi-tenant parallelism. In fact, many common data science challenges run faster on CPU than high-end GPU (!?)

renatobellotti commented 4 years ago

Another thought that came to me: Probably, the most flexible solution could be to implement the model training/evaluation tasks as jobs in a scheduling system, e. g. slurm. Then the user can provide information whether CPU/GPU nodes should be used etc.

renatobellotti commented 4 years ago

Has there been any progress?

FranzHahn commented 3 years ago

Another thought that came to me: Probably, the most flexible solution could be to implement the model training/evaluation tasks as jobs in a scheduling system, e. g. slurm. Then the user can provide information whether CPU/GPU nodes should be used etc.

I think connecting to something like slurm would be super useful!

mikkokotila commented 2 years ago

Merging with autonomio/jako#10