Open bbudescu opened 1 week ago
Hi @bbudescu, Thanks for raising this issue and providing potential solutions. As far as I can see, the potential solutions that you proposed are mostly related to the RF. We are also planning to replace the current RF packages: https://github.com/automl/SMAC3/issues/1116
However, the new RF model might still rely on the existing packages (e.g., sklearn). Since we are only a small team maintaining SMAC, we might not have enough manpower to customize a new random forest package. However, if you have any good ideas on how to implement this efficiently (in Python) or you would like to create a new PR regarding the RF replacement stuff. We are happy to help you (on how to integrate this into SMAC).
Hi @dengdifan,
Thanks for your reply above. Yes, I am aware of the #1116, however, the main point I was trying to get across in this issue was making the ask
operation asynchronous w.r.t. RF model training, i.e., being able to query an existing RF model for a new configuration to try anytime (even if the model is lagging behind, i.e., it hasn't been updated with the absolute latest results), rather than being forced to wait until training finishes, because it's better to use the cpu cores rather than to keep them unoccupied more than half of the time. I mean, even if it's not looking in the best places, it still explores the config space, and that's preferable to not doing anything at all.
Now, I haven't looked into the code, but I assume this doesn't have anything to do with the choice of random forest package, but just with running RF training in a separate thread. Perhaps one of the optional suggestions I made, namely, the one where I suggested adding GPU support, might be relevant to the choice of RF implementation.
Motivation
Here's how a cpu load graph looks like for a multi-objective optimization session using the multi-fidelity facade that ran for about 46 hours an a 64 core machine (no hyperthreading,
n_workers=64
), finishing almost 20k trials on a bit over 16k distinct configurations (two rungs).One can see that cpu utilizations decreases to less than 50% after the first 12 hours. It then drops to under 40% after another 10 hours (by this time, 12.6k trials were finished in total).
Previous Discussion
I thought that another cause for this degradation in performance might be Hyperband, and thought that using ASHA (https://github.com/automl/SMAC3/issues/1169) instead would help eliminate that hypothesis, however, after @eddiebergman's https://github.com/automl/SMAC3/issues/1169#issuecomment-2493662911, I understand the problem is caused by workers waiting to get another suggestion from the surrogate model
Potential solution