automl / neps

Neural Pipeline Search (NePS): Helps deep learning experts find the best neural pipeline.
https://automl.github.io/neps/
Apache License 2.0
52 stars 13 forks source link

Multi fidelity is not "multi" for some fidelity boundaries #138

Open AwePhD opened 3 weeks ago

AwePhD commented 3 weeks ago

Hi,

The multi-fidelity by PriorBand is not applied correctly, probably because I did not configure the Optimizer. I would like to know if it's normal or I misunderstood something about the multi-fidelity setup in NePS.

Here a Python script that reproduces the behaviour with neps 0.12.2:

from pathlib import Path
from neps import IntegerParameter, FloatParameter, run

MIN_EPOCH, MAX_EPOCH, TOTAL_BUDGET_EPOCHS = ..., ..., ...

pipeline_space = {
    "epoch": IntegerParameter(lower=MIN_EPOCH, upper=MAX_EPOCH, is_fidelity=True),
    "p1": IntegerParameter(lower=5, upper=15, default=10),
    "p2": FloatParameter(lower=.5, upper=5, default=3),
}

def run_pipeline(epoch, p1, p2) -> dict | float:
    loss = (p1 + p2) / epoch

    return {"loss": loss, "cost": epoch}

def main():
    run(
        run_pipeline=run_pipeline,
        pipeline_space=pipeline_space,
        root_directory=Path("/tmp/debug_neps_fidelity"),
        max_cost_total=TOTAL_BUDGET_EPOCHS,
    )

if __name__ == '__main__':
    main()

When I run with various values for the boundaries I have different numbers of fidelity (it's called rungs, right?):

# 9 configs, all with 200 epochs
MIN_EPOCH, MAX_EPOCH, TOTAL_BUDGET_EPOCHS = 80, 200, 2_000
# 11 configs and 2 configs reach 3 trials, 2 reach 2 trials and the rest have one trial 
MIN_EPOCH, MAX_EPOCH, TOTAL_BUDGET_EPOCHS = 1, 10, 40

The first example is important for debugging when I take a tiny subset of my dataset for debugging purpose. I need a lot of epochs to overfit the model (+300M parameters) on few samples, before 80 epochs the model is not learning at all. It seems that the fidelity starts at 80 is a problem? Is it an expected ((multi-))fidelity strategy? Is it because the ratio for computing the number of rungs is max_fidelity/min_fidelity? (I read that from PriorBand paper if I remember well).

Should I setup the eta parameter? By the way, I am happy about the second example, which represents an actual training on the whole dataset.

AwePhD commented 3 weeks ago

Another addition for this multi fidelity settings. If I do not use max_cost_total and use the simple max_evaluations_total instead, PriorBand only evaluate at the max fidelity (200).

from pathlib import Path
from neps import IntegerParameter, FloatParameter, run

WORKDIR_PATH = "/tmp/debug_neps_fidelity"

MIN_EPOCH, MAX_EPOCH = 80, 200
TOTAL_EVALUATIONS = 20

pipeline_space = {
    "epoch": IntegerParameter(lower=MIN_EPOCH, upper=MAX_EPOCH, is_fidelity=True),
    "p1": IntegerParameter(lower=5, upper=15, default=10),
    "p2": FloatParameter(lower=.5, upper=5, default=3),
}

def run_pipeline(epoch, p1, p2) -> dict | float:
    loss = (p1 + p2) / epoch

    return loss

def main():
    run(
        run_pipeline=run_pipeline,
        pipeline_space=pipeline_space,
        root_directory=Path(WORKDIR_PATH),
        max_evaluations_total=TOTAL_EVALUATIONS,
    )

if __name__ == '__main__':
    main()

Once again, I probably misunderstand something about the choice of the fidelity. Maybe this behavior is OK.

eddiebergman commented 2 weeks ago

@Neeratyoy

Neeratyoy commented 1 week ago

Hi @AwePhD Sorry for the late response and thanks for the reproducible example

Looking at the example and what you described, the behaviour seen is expected The short answer is that when MAX_BUDGET is 200, and ETA is 3 (by default), the first approximated fidelity is approx. 200 // 3 which is <MIN_BUDGET (=80) in your first case above. Hence, it defaults to just one rung, that is, the MAX_BUDGET.

You can use this function to approximately check what will happen with HyperBand budgets given your fidelity bounds:

import math

def check_budget_levels(MIN_EPOCH, MAX_EPOCH, ETA=3):
    _min = MAX_EPOCH
    counter = 0
    fid_level = math.ceil(math.log(MAX_EPOCH / MIN_EPOCH) / math.log(ETA))
    while _min >= MIN_EPOCH:
        print(f"Level: {fid_level} -> {_min}")
        _min = _min // ETA
        counter += 1
        fid_level -= 1
    return

Do you think it will be a better interface if we raise a warning and breakout if there is only one budget level available (like your case 1 with 80,200) and we cannot really run HyperBand (multi-fidelity)?

AwePhD commented 6 days ago

Hi,

No problem for the delay at all.

Thanks for the function. I come from Vision/Language Deep Learning and I have little familiarity with HPO research, I read some papers though. I think it might be good to add in the documentation what are the value of the rungs based on ETA for multi-fidelity optimizer. Namely, what are the values of the rungs based on min and max value of the multi-fidelity parameter. Also, I think, the levels of fidelity should be logged somewhere in my opinion: in the console or in the .optimizer_info.yaml - I know there is the eta parameter inside it but, as you showed, it needs a bit* of math to get the level of multi-fidelity.

Also, to answer more directly, when a multi-fidelity parameter is passed and there is only one level of fidelity, I think a warning should be logged. Actually, when the user sets a parameter for multi-fidelity then there might be some chances that the user does not want the number of rungs to be equal to 1. I do not think a breakout is necessary.

*: I know that this computation is written in hyperband paper and probably in other papers using multi-fidelity. But, in my opinion, a practical user of NePS, not coming from the HPO community, should not read a paper to use the code. But, it's up to you, maybe you reasonably wish that the users of NePS have some knowledge of the HPO litterature. Plus, maybe some other multi-fidelity methods (will) have a different way to compute the number of rungs. IMO for a practical HPO user, the first thing accessible should be the number of rungs and their values, then a place in the documentation to see how they are computed based on eta and the fidelity values (or anything else).