Long execution times when retrieving hyperdrive results

Azure / MachineLearningNotebooks

Python notebooks with ML and deep learning examples with Azure Machine Learning Python SDK | Microsoft

https://docs.microsoft.com/azure/machine-learning/service/

MIT License

4.11k stars 2.52k forks source link

Long execution times when retrieving hyperdrive results #1302

Open jarandaf opened 3 years ago

jarandaf commented 3 years ago

exp = Experiment(ws, 'hyperdrive')
exp_runs = exp.get_runs()
run_id = '#hyperdrive_run_id'
for run in exp_runs:
    if run.id == run_id:
        break

# this takes quite some time
hyperparams_set = run.get_hyperparameters()
metrics = run.get_metrics()

We notice long execution times (of the order of ~5 minutes) when retrieving hyperdrive results (hyperparameter results). We log simple values and some lists during the hyperdrive step execution, which does not have more than 300 child runs. Is this the expected behaviour?

lostmygithubaccount commented 3 years ago

I generally find querying AML runs through the Python SDK fairly slow - will assign the hyperdrive team to check if anything specific but I suspect not

jarandaf commented 3 years ago

Is there any other more performant way to query AML runs?

tunayokumus commented 3 years ago

We are also experiencing the same issue. Are there any updates or suggestions on this?

jarandaf commented 3 years ago

Latest conversations I had with the engineering team confirmed the issue but there is no fix yet AFAIK. Long running times usually appear for array-like metrics (e.g. training loss over epochs). For single-value metrics the following is a possible workaround and runs way faster despite having other logged array-like metrics:

metrics = {run.id:run.get_metrics('<metric_name>') for run in hdrun.get_children()}

tunayokumus commented 3 years ago

@jarandaf thanks a lot for the insights 👍 However, with the suggested workaraund I got only slightly better performance (5mins vs 6mins) compared to hdrun.get_metrics(name="<metric_name>", recursive=True)

this is with a scalar-value metric over 1000 child runs.

EDIT: By mistake I apparently applied a trick here and created hdrun object with the constructer of Run class. The child class HyperDriveRun for some reason does not accept these arguments, unlike its parent.)

tunayokumus commented 3 years ago

Hi @mx-iao, I was advised to loop you in here. Do you have any solution or know anyone who might?