allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.67k stars 653 forks source link

HPO "Trial X failed because of the following error: The value None could not be cast to float." #873

Open vitalyk-multinarity opened 1 year ago

vitalyk-multinarity commented 1 year ago

Describe the bug

HPO application launches many experiments instead of 10.

To reproduce

Expected behaviour

Log - bfc4cf8c405c4a49b55aeb8cff0b918f.txt

HPO apps doesn't stop after 10 experiments. Screenshot 2023-01-08 at 16 02 07 Screenshot 2023-01-08 at 16 01 45

Environment

https://clearml.slack.com/archives/CTK20V944/p1673177405220439

jkhenning commented 1 year ago

Hi @vitalyk-multinarity ,

Can you try with clearml v1.9.1rc0 ?

vitalyk-multinarity commented 1 year ago

Hi @jkhenning , As far as I see - the same behavior (I upgraded clearml to 1.9.1rc0, restarted clearml-agent, cloned my HPO task and ran it).

But: 1) into my ClearML task I see "clearml 1.9.0" 2) into 'hpo-app' NSTALLED PACKAGES I see "clearml==1.3.1"

I'm not sure why.

/home/ubuntu/.cache/pypoetry/virtualenvs/keyboard-tracking-l9nr52Rk-py3.9/bin/python -m pip list|grep clearml
clearml                 1.9.1rc0
clearml-agent           1.5.1

And this is the same venv I'm using:

export CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL=true
CLEARML_AGENT_SKIP_PIP_VENV_INSTALL=/home/ubuntu/.cache/pypoetry/virtualenvs/keyboard-tracking-l9nr52Rk-py3.9/bin/python  clearml-agent daemon --queue default --detached  

In general, as far as I understand, my issue related to 'understanding' sub-tasks results by HPO orchestrator. So I'm attaching more details: 1) reporting values clearml_logger = Logger.current_logger() clearml_logger.report_single_value(name="mean_angular_distance[deg]", value=round(loc_kpi_results.get("mean_angular_distance[deg]"),3)) clearml_logger.report_single_value(name="mean_translation_rms_mm", value=round(loc_kpi_results.get("mean_translation_rms_mm"),3)) clearml_logger.report_single_value(name="predicted_ratio", value=round(loc_kpi_results.get("predicted_ratio"),3))

2) screenshot of sub-task, including 'mean_translation_rms_mm' value used for optimization

Screenshot 2023-01-09 at 9 56 17
vitalyk-multinarity commented 1 year ago

@jkhenning, Fixed this error by using 'report_scalar' instead of "report_single_value". So I guess we can close this issue.

Thank you, Vitaly

jkhenning commented 1 year ago

Thanks @vitalyk-multinarity. I think we'll take a look at this regardless to understand why this error shows in that scenario.

eugen-ajechiloae-clearml commented 1 year ago

Hi @vitalyk-multinarity! HPO apps doesn't stop after 10 experiments. This is likely because you didn't set Configuration -> Limit Total HPO Experiments to 10 (please correct me if I'm wrong). Regardless, we might want to limit the number of experiments to the minimum of total_experiments and product((max_val - min_val) / step for each parameter to optimize).