automl / Auto-PyTorch

Automatic architecture search and hyperparameter optimization for PyTorch
Apache License 2.0
2.37k stars 287 forks source link

log-dir/results.json has some exceptions #61

Closed maxmarketit closed 3 years ago

maxmarketit commented 4 years ago

Here is part of my log file results.json.

[[48, 0, 0], 1200.0, {"submitted": 1592881287.3019679, "started": 1592881287.3047717, "finished": 1592882478.390403}, {"loss": 0.4213403548487583, "info": {"loss": 0.2595444439162671, "model_parameters": 496799.0, "train_metric_ql": 0.25954445002785426, "lr_scheduler_converged": 0.0, "lr": 0.004473769175123476, "val_metric_ql": 0.4213403548487583}}, null]
[[48, 0, 1], 1200.0, {"submitted": 1592882478.4058812, "started": 1592882478.4085217, "finished": 1592882508.2846193}, null, "Traceback (most recent call last):\n  File \"/home/ubuntu/anaconda3/envs/autopytorch/lib/python3.6/site-packages/hpbandster/core/worker.py\", line 206, in start_computation\n    result = {'result': self.compute(*args, config_id=id, **kwargs),\n  File \"/home/ubuntu/anaconda3/envs/autopytorch/lib/python3.6/site-packages/autoPyTorch-0.0.2-py3.6.egg/autoPyTorch/core/worker.py\", line 87, in compute\n    raise Exception(\"Exception in train pipeline. Took \" + str((time.time()-start_time)) + \" seconds with budget \" + str(budget))\nException: Exception in train pipeline. Took 29.862977743148804 seconds with budget 1200.0\n"]
[[48, 0, 2], 1200.0, {"submitted": 1592882508.302436, "started": 1592882508.3056898, "finished": 1592882515.6628084}, null, "Traceback (most recent call last):\n  File \"/home/ubuntu/anaconda3/envs/autopytorch/lib/python3.6/site-packages/hpbandster/core/worker.py\", line 206, in start_computation\n    result = {'result': self.compute(*args, config_id=id, **kwargs),\n  File \"/home/ubuntu/anaconda3/envs/autopytorch/lib/python3.6/site-packages/autoPyTorch-0.0.2-py3.6.egg/autoPyTorch/core/worker.py\", line 87, in compute\n    raise Exception(\"Exception in train pipeline. Took \" + str((time.time()-start_time)) + \" seconds with budget \" + str(budget))\nException: Exception in train pipeline. Took 7.344435214996338 seconds with budget 1200.0\n"]

If I infer what's wrong from the position of exception and null, I think loss was null so there might have been problem calculating "info": {"loss": , "model_parameters": , "train_metric_ql": , "lr_scheduler_converged": , "val_metric_ql": }}. If this is indeed the only problem(something like loss goes to nan), I think it would be better to log what happened instead of logging several exceptions...

LMZimmer commented 4 years ago

I think you are right. Is that occurring with tabular data? In that case this line will be changed