HPO "Trial X failed because of the following error: The value None could not be cast to float."

vitalyk-multinarity commented 1 year ago

Describe the bug

HPO application launches many experiments instead of 10.

To reproduce

I created HPO app in ClearML UI, with certain parameter to optimize, with minumal value=0.0, max=1.0 and step=1.0. So as far as I understand this HPA should launch 10 experiments. In fact, it's launching tens of experiments (>30 and still running), many of them with the same value.
https://app.clear.ml/applications/hpo-app/info;experimentId=6ee65ed03395446a99f7105110f7208d
Logs and screenshots are attached.

Expected behaviour

Log - bfc4cf8c405c4a49b55aeb8cff0b918f.txt

HPO apps doesn't stop after 10 experiments. Screenshot 2023-01-08 at 16 02 07 Screenshot 2023-01-08 at 16 01 45

Environment

Server type: app.clear.ml
ClearML SDK Version: Python modules "clearml 1.9.0 clearml-agent
Log - 6ee65ed03395446a99f7105110f7208d.txt 1.5.1"
Python Version: 3.9
OS: Linux into Docker
Related Discussion

https://clearml.slack.com/archives/CTK20V944/p1673177405220439

jkhenning commented 1 year ago

Hi @vitalyk-multinarity ,

Can you try with clearml v1.9.1rc0 ?

vitalyk-multinarity commented 1 year ago

Hi @jkhenning , As far as I see - the same behavior (I upgraded clearml to 1.9.1rc0, restarted clearml-agent, cloned my HPO task and ran it).

But: 1) into my ClearML task I see "clearml 1.9.0" 2) into 'hpo-app' NSTALLED PACKAGES I see "clearml==1.3.1"

I'm not sure why.

/home/ubuntu/.cache/pypoetry/virtualenvs/keyboard-tracking-l9nr52Rk-py3.9/bin/python -m pip list|grep clearml
clearml                 1.9.1rc0
clearml-agent           1.5.1

And this is the same venv I'm using:

export CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL=true
CLEARML_AGENT_SKIP_PIP_VENV_INSTALL=/home/ubuntu/.cache/pypoetry/virtualenvs/keyboard-tracking-l9nr52Rk-py3.9/bin/python  clearml-agent daemon --queue default --detached

In general, as far as I understand, my issue related to 'understanding' sub-tasks results by HPO orchestrator. So I'm attaching more details: 1) reporting values clearml_logger = Logger.current_logger() clearml_logger.report_single_value(name="mean_angular_distance[deg]", value=round(loc_kpi_results.get("mean_angular_distance[deg]"),3)) clearml_logger.report_single_value(name="mean_translation_rms_mm", value=round(loc_kpi_results.get("mean_translation_rms_mm"),3)) clearml_logger.report_single_value(name="predicted_ratio", value=round(loc_kpi_results.get("predicted_ratio"),3))

2) screenshot of sub-task, including 'mean_translation_rms_mm' value used for optimization

vitalyk-multinarity commented 1 year ago

@jkhenning, Fixed this error by using 'report_scalar' instead of "report_single_value". So I guess we can close this issue.

Thank you, Vitaly

jkhenning commented 1 year ago

Thanks @vitalyk-multinarity. I think we'll take a look at this regardless to understand why this error shows in that scenario.

eugen-ajechiloae-clearml commented 1 year ago

Hi @vitalyk-multinarity! HPO apps doesn't stop after 10 experiments. This is likely because you didn't set Configuration -> Limit Total HPO Experiments to 10 (please correct me if I'm wrong). Regardless, we might want to limit the number of experiments to the minimum of total_experiments and product((max_val - min_val) / step for each parameter to optimize).

allegroai / clearml