Check restarting/handling of pending config when resuming a run

automl / neps

Neural Pipeline Search (NePS): Helps deep learning experts find the best neural pipeline.

Apache License 2.0

44 stars 11 forks source link

For potential reproducibility of the observed issue:

Running Random Search for 20 (max_evaluations_total) evaluations distributed across 4 workers
Midway through the run, killed a worker and restarted the worker soon enough
The overall run ran fine but noticed certain anomalies, as described below,
1. The process termination halted a config, for example, config ID 16
2. On restarting, the 4 workers proceeded fine without errors but an extra config ID 21 was generated while config ID 16 was not re-evaluated or completed and remains pending forever

Some more observations:

For max_evaluations_total=20 we should have config IDs from 1-20 with each of them having their own result.yaml
Only config_16 does not have result.yaml whereas config_21 does
If I now re-run a worker as max_evaluations_total=21, it now satisfies that extra config required by sampling a new config config_22

Should a new worker, re-evaluate pending configs, as priority? Also with this issue or under this scenario the generated config IDs range from [1, n+1] if max_evaluations_total=n.

This happens when the process is force-killed during the evaluation of a config, and is reproducible with a single process.

To reproduce:

Choose an algorithm which have very low overhead: e.g Random Search
Write a run_pipeline(...) function which takes a relatively long time compared to the algorithm overhead: e.g time.sleep(10)
Run neps.api.run. Arguments don't matter this should reproduce
If the logs are observed terminate the process once the algorithm enters the evaluation phase with the log Start evaluating config .... Otherwise, refine the steps 1 and 2 to increase your chance of terminating during evaluation.
If after termination there is a config with a missing result.yaml file, you have successfully interrupted an evaluation.
Re-run the process to see the effect described.

Alternatively, You can skip the steps 1-5, and manually delete a result.yaml file from any config folder to make NePs think that, there is a pending config some mysterious other process is handling right now.

automl / neps

Check restarting/handling of pending config when resuming a run #30