Trying to run multisession example and getting some warnings initially and then a value error, can you please recommend any debug steps? I am not familiar with Ray tune for training.
Getting these warnings initially, not sure if anything is broken due to these:
2024-01-26 11:26:57,399 INFO worker.py:1528 -- Started a local Ray instance.
C:\Users\anaconda3\envs\lfads-torch\lib\site-packages\ray\tune\trainable\function_trainable.py:609: DeprecationWarning: checkpoint_dir in func(config, checkpoint_dir) is being deprecated. To save and load checkpoint in trainable functions, please use the ray.air.session API:
warnings.warn(
2024-01-26 11:26:59,210 WARNING trial_runner.py:1604 -- You are trying to access _search_alg interface of TrialRunner in TrialScheduler, which is being restricted. If you believe it is reasonable for your scheduler to access this TrialRunner API, please reach out to Ray team on GitHub. A more strict API access pattern would be enforced starting 1.12s.0****
Value error that terminates the script. If any other metric in Result (for ex. timestamp) is used, it proceeds from this step but fails eventually due to some other dependency on 'cur_epoch' metric for tuning:
ValueError: Trial returned a result which did not include the specified metric(s) valid/recon_smth that tune.TuneConfig() expects. Make sure your calls to tune.report() include the metric, or set the TUNE_DISABLE_STRICT_METRIC_CHECKING environment variable to 1. Result: {'trial_id': 'dfd00_00000', 'experiment_id': 'e5eb8f5c73b546ee9bef65bb16997574', 'date': '2024-01-26_11-27-03', 'timestamp': 1706297223, 'pid': 95292, 'hostname': 'DESKTOP', 'node_ip': '127.0.0.1', 'done': True, 'config/datamodule': 'BMI_multisession_PCR', 'config/model': 'BMI_multisession_PCR', 'config/logger.wandb_logger.project': 'BMI', 'config/logger.wandb_logger.tags.0': 'BMI_multisession_PCR', 'config/logger.wandb_logger.tags.1': 'version_240126112654', 'config/model.lr_init': 0.001, 'config/model.dropout_rate': 0.3511779084499725, 'config/model.train_aug_stack.transforms.0.cd_rate': 0.5, 'config/model.kl_co_scale': 0.0001115416382089259, 'config/model.kl_ic_scale': 0.00010476283727212514, 'config/model.l2_gen_scale': 0.5024837234461056, 'config/model.l2_con_scale': 0.1221168826037272}
This is conda setup on Windows system, so it did require some config file path changes to absolute paths, instead of relative paths.
Hi Andrew,
Trying to run multisession example and getting some warnings initially and then a value error, can you please recommend any debug steps? I am not familiar with Ray tune for training.
Getting these warnings initially, not sure if anything is broken due to these:
2024-01-26 11:26:57,399 INFO worker.py:1528 -- Started a local Ray instance. C:\Users\anaconda3\envs\lfads-torch\lib\site-packages\ray\tune\trainable\function_trainable.py:609: DeprecationWarning:
checkpoint_dir
infunc(config, checkpoint_dir)
is being deprecated. To save and load checkpoint in trainable functions, please use theray.air.session
API:from ray.air import session
def train(config):
...
For more information please see https://docs.ray.io/en/master/tune/api_docs/trainable.html
warnings.warn( 2024-01-26 11:26:59,210 WARNING trial_runner.py:1604 -- You are trying to access _search_alg interface of TrialRunner in TrialScheduler, which is being restricted. If you believe it is reasonable for your scheduler to access this TrialRunner API, please reach out to Ray team on GitHub. A more strict API access pattern would be enforced starting 1.12s.0****
Value error that terminates the script. If any other metric in Result (for ex. timestamp) is used, it proceeds from this step but fails eventually due to some other dependency on 'cur_epoch' metric for tuning:
ValueError: Trial returned a result which did not include the specified metric(s)
valid/recon_smth
thattune.TuneConfig()
expects. Make sure your calls totune.report()
include the metric, or set the TUNE_DISABLE_STRICT_METRIC_CHECKING environment variable to 1. Result: {'trial_id': 'dfd00_00000', 'experiment_id': 'e5eb8f5c73b546ee9bef65bb16997574', 'date': '2024-01-26_11-27-03', 'timestamp': 1706297223, 'pid': 95292, 'hostname': 'DESKTOP', 'node_ip': '127.0.0.1', 'done': True, 'config/datamodule': 'BMI_multisession_PCR', 'config/model': 'BMI_multisession_PCR', 'config/logger.wandb_logger.project': 'BMI', 'config/logger.wandb_logger.tags.0': 'BMI_multisession_PCR', 'config/logger.wandb_logger.tags.1': 'version_240126112654', 'config/model.lr_init': 0.001, 'config/model.dropout_rate': 0.3511779084499725, 'config/model.train_aug_stack.transforms.0.cd_rate': 0.5, 'config/model.kl_co_scale': 0.0001115416382089259, 'config/model.kl_ic_scale': 0.00010476283727212514, 'config/model.l2_gen_scale': 0.5024837234461056, 'config/model.l2_con_scale': 0.1221168826037272}This is conda setup on Windows system, so it did require some config file path changes to absolute paths, instead of relative paths.