Epistimio / orion

Asynchronous Distributed Hyperparameter Optimization.
https://orion.readthedocs.io
Other
282 stars 52 forks source link

Work Queues #359

Open Delaunay opened 4 years ago

Delaunay commented 4 years ago

Why

Current system is rigid and does not give a lot of control into how work is distributed. Nor is it easy to change or build on top of it.

Work queues standardize how work is distributed and how results are collected enabling us to make Orion db agnostic. This will allow us to leverage other software such as Dask to schedule and distribute the training of trials.

Example

HPO master

queue_uri = 'mq://20.30.120.30:8123

# launch workers somewhere, could be on different nodes using SLURM job array
workers = [Worker(queue_uri) for i in range(10)]

# Launch HPO can be on another node, the users own computer etc...
hpo = HPO(queue_uri, space, function_to_optimize)

hpo_is_independent

Nomad HPO: HPO is a simple task done by a worker that is instantiated when needed

queue_uri = 'mq://20.30.120.30:8123

# tell a worker to start a HPO that will generate work items
put(queue_uri, 'HPO',  space, function_to_optimize)

# launch workers somewhere, could be on different nodes using SLURM job array
workers = [Worker(queue_uri) for i in range(10)]

hpo_is_worker

bouthilx commented 4 years ago

Agreed! Once we tackle #358, this refactoring of the workers into a worker queue should be much more simple.

guillaume-chevalier commented 4 years ago

I think you'll want to check this out: https://github.com/Neuraxio/Neuraxle/issues/221

guillaume-chevalier commented 4 years ago

@bouthilx @Delaunay we also have done a first pass of coding what I've referred to in the issue above: https://github.com/Neuraxio/Neuraxle/pull/242