Avoid workers waiting for training of surrogate model to finish

bbudescu commented 1 week ago

Motivation

Here's how a cpu load graph looks like for a multi-objective optimization session using the multi-fidelity facade that ran for about 46 hours an a 64 core machine (no hyperthreading, n_workers=64), finishing almost 20k trials on a bit over 16k distinct configurations (two rungs).

Screenshot 2024-11-23 at 11-59-52 Instances EC2 eu-west-1

One can see that cpu utilizations decreases to less than 50% after the first 12 hours. It then drops to under 40% after another 10 hours (by this time, 12.6k trials were finished in total).

Previous Discussion

I thought that another cause for this degradation in performance might be Hyperband, and thought that using ASHA (https://github.com/automl/SMAC3/issues/1169) instead would help eliminate that hypothesis, however, after @eddiebergman's https://github.com/automl/SMAC3/issues/1169#issuecomment-2493662911, I understand the problem is caused by workers waiting to get another suggestion from the surrogate model

Potential solution

train the random forest in a different thread / process
replace a version of the RF with a newly trained one only when the training is done (similar to double buffering)
workers should always get configs using the currently available RF, even if a new RF is training in the background
optionally: use an occupancy threshold, e.g., 90%, and allow worker threads to wait for training to finish only if the percentage of workers that are idle waiting for the new RF version is below 10%.
optionally: add gpu support to accelerate training of Random Forest
optionally: perhaps add the option to decrease the number of workers running the target function by 1 once the RF trainer occupies a CPU core for more than 50% of time

dengdifan commented 3 days ago

Hi @bbudescu, Thanks for raising this issue and providing potential solutions. As far as I can see, the potential solutions that you proposed are mostly related to the RF. We are also planning to replace the current RF packages: https://github.com/automl/SMAC3/issues/1116

However, the new RF model might still rely on the existing packages (e.g., sklearn). Since we are only a small team maintaining SMAC, we might not have enough manpower to customize a new random forest package. However, if you have any good ideas on how to implement this efficiently (in Python) or you would like to create a new PR regarding the RF replacement stuff. We are happy to help you (on how to integrate this into SMAC).

bbudescu commented 3 days ago

Hi @dengdifan,

Thanks for your reply above. Yes, I am aware of the #1116, however, the main point I was trying to get across in this issue was making the ask operation asynchronous w.r.t. RF model training, i.e., being able to query an existing RF model for a new configuration to try anytime (even if the model is lagging behind, i.e., it hasn't been updated with the absolute latest results), rather than being forced to wait until training finishes, because it's better to use the cpu cores rather than to keep them unoccupied more than half of the time. I mean, even if it's not looking in the best places, it still explores the config space, and that's preferable to not doing anything at all.

Now, I haven't looked into the code, but I assume this doesn't have anything to do with the choice of random forest package, but just with running RF training in a separate thread. Perhaps one of the optional suggestions I made, namely, the one where I suggested adding GPU support, might be relevant to the choice of RF implementation.

automl / SMAC3