rohitrangwani commented 9 months ago

Hi Andrew,

Trying to run multisession example and getting some warnings initially and then a value error, can you please recommend any debug steps? I am not familiar with Ray tune for training.

Getting these warnings initially, not sure if anything is broken due to these:

2024-01-26 11:26:57,399 INFO worker.py:1528 -- Started a local Ray instance. C:\Users\anaconda3\envs\lfads-torch\lib\site-packages\ray\tune\trainable\function_trainable.py:609: DeprecationWarning: checkpoint_dir in func(config, checkpoint_dir) is being deprecated. To save and load checkpoint in trainable functions, please use the ray.air.session API:

from ray.air import session

def train(config):

...

session.report({"metric": metric}, checkpoint=checkpoint)

For more information please see https://docs.ray.io/en/master/tune/api_docs/trainable.html

warnings.warn( 2024-01-26 11:26:59,210 WARNING trial_runner.py:1604 -- You are trying to access _search_alg interface of TrialRunner in TrialScheduler, which is being restricted. If you believe it is reasonable for your scheduler to access this TrialRunner API, please reach out to Ray team on GitHub. A more strict API access pattern would be enforced starting 1.12s.0****

Value error that terminates the script. If any other metric in Result (for ex. timestamp) is used, it proceeds from this step but fails eventually due to some other dependency on 'cur_epoch' metric for tuning:

ValueError: Trial returned a result which did not include the specified metric(s) valid/recon_smth that tune.TuneConfig() expects. Make sure your calls to tune.report() include the metric, or set the TUNE_DISABLE_STRICT_METRIC_CHECKING environment variable to 1. Result: {'trial_id': 'dfd00_00000', 'experiment_id': 'e5eb8f5c73b546ee9bef65bb16997574', 'date': '2024-01-26_11-27-03', 'timestamp': 1706297223, 'pid': 95292, 'hostname': 'DESKTOP', 'node_ip': '127.0.0.1', 'done': True, 'config/datamodule': 'BMI_multisession_PCR', 'config/model': 'BMI_multisession_PCR', 'config/logger.wandb_logger.project': 'BMI', 'config/logger.wandb_logger.tags.0': 'BMI_multisession_PCR', 'config/logger.wandb_logger.tags.1': 'version_240126112654', 'config/model.lr_init': 0.001, 'config/model.dropout_rate': 0.3511779084499725, 'config/model.train_aug_stack.transforms.0.cd_rate': 0.5, 'config/model.kl_co_scale': 0.0001115416382089259, 'config/model.kl_ic_scale': 0.00010476283727212514, 'config/model.l2_gen_scale': 0.5024837234461056, 'config/model.l2_con_scale': 0.1221168826037272}

This is conda setup on Windows system, so it did require some config file path changes to absolute paths, instead of relative paths.

arsedler9 commented 8 months ago

Hey @rohitrangwani, can you first try setting up a single run script using your configs and see if that runs without errors?

rohitrangwani commented 8 months ago

Single run script runs without any errors.

I am trying to do multisession.

arsedler9 / lfads-torch

Ray tune warnings and 'metric' not reported in result (ValueError) #15

...