[Bug] Parallelism and seeding

Neeratyoy commented 5 months ago

How the seed is set prior to neps.run() call and how then different nep.run() are spawned makes a difference in the seeding effect.

Issue example

parallel_debug

In this plot, random_search is a single worker run, while the other 2 are the same neps setting run differently to create workers. Both show vastly different behaviour. random_search_multiprocessing shows an initial speedup which should be the case for early budgets in random search parallelization.

Desired setting

Both the random_search_* lines should be exactly the same and provide early speedups over random_search.

Reproducibility steps

The following set of steps should lead to reproducing the issue:

# Setup a Python3.10 environment through conda or venv

git clone https://github.com/Neeratyoy/neps_template.git
cd neps_template/
pip install -r requirements.txt

To run the single worker baseline:

# Single worker run

python hpo_target.py --algo run_rs

To run the same but with 4 workers:

# 4 worker run through multiple execution of the main script (desired use)

python hpo_target.py --algo run_rs_nohup & \
  python hpo_target.py --algo run_rs_nohup & \
  python hpo_target.py --algo run_rs_nohup & \
  python hpo_target.py --algo run_rs_nohup &

Now running the same using multiprocessing:

# 4 worker run where the main script relies on Python subprocess to spawn workers

python hpo_target.py --algo run_rs_multiprocessing --n_workers 4

NOTE: The only difference between run_rs, run_rs_nohup, run_rs_multiprocessing is the output path, everything else is the same, i.e., the same NePS run.

To plot:

python plot_neps.py  \
  --root_directory neps_output/ \
  --log_y \
  --algos random_search random_search_nohup random_search_multiprocessing \
  --filename parallel_debug

eddiebergman commented 5 months ago

Some preliminary investigation, where I reduced the search space to just tiny models and only run 8 configurations total:

Single worker correctly has a unique configuration for all 8 run
Both parallel version (nohup, MP) have 2 unique configurations, each run 4 times, once on each worker.
I also ran 4 independant shells, same behaviour as above.

As for the timing mismatch, I don't know why this is in particular for your case but it's quite noisy with configurations. For example, many layers and nuerons, if it gets repeated on all 4 workers, is going to be slower then 4 tiny configs on a single worker. Either way, workers are all getting to do work so I don't think there's a real MP slow down somewhere (other than syncing stages between workers).

Will do a bit more of a debug

eddiebergman commented 5 months ago

Figure_1

Investigating the timing thing but I imagine it's some timing error because from running both, multiprocessing is definitely faster. Might also have been conflated with the comment above this.

As for the weird thing from the orange bar, that's an artifact of out of order evaluations returning. All configurations sampled had the correct hyperparameters in the correct order.

eddiebergman commented 5 months ago

Timing manually, I got that 4 workers was faster but of course not 4x faster. Seems that a single worker was keeping two full cores busy with some sporadic bursts on my others cores. With 4 parallel workers, it kept all of my 8 cores full.

My guess is the bottleneck is something to do with data loading and it not being optimized for usage in parallel setting.

automl / neps