An error occurred while saving the training model

fanxuxiang commented 1 year ago

Hi! Thanks for sharing your excellent work. I am very interested in it. @Etienne-Meunier

But when executing the training command(_python3 model_train.py --path_save_model train_me --base_dir /home/fxx/data/DAVIS-data --data_file DataSplit_me/DAVISD16Split ), some errors occurred. It seems that there was a hyperparameter storage error when saving the model. I have tried many methods like https://github.com/pytorch/pytorch/issues/78720 and https://github.com/Lightning-AI/lightning/issues/9318 , but cannot solve it. The default DAVIS dataset is used, and the body of the code has not been changed. Does anyone encounter this problem or know how to solve it？

[ ] Environment PyTorch Lightning Version 1.5.10 PyTorch Version (e.g., 1.10): 1.8.0+cu111 Python version : 3.6.13 OS (e.g., Linux): Linux CUDA/cuDNN version: 11.1 GPU models and configuration: GTX 3060 How you installed PyTorch (conda, pip, source): pip Any other relevant information: torchmetrics version (e.g., 0.5.0, 0.4.1): 0.8.2
[ ] Additional Context Traceback (most recent call last): File "model_train.py", line 59, in <module> trainer.fit(model, dm) File "/home/fxx/anaconda3/envs/flownet2/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 741, in fit self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path File "/home/fxx/anaconda3/envs/flownet2/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt return trainer_fn(*args, **kwargs) File "/home/fxx/anaconda3/envs/flownet2/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 777, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "/home/fxx/anaconda3/envs/flownet2/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1188, in _run self._pre_dispatch() File "/home/fxx/anaconda3/envs/flownet2/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1224, in _pre_dispatch self._log_hyperparams() File "/home/fxx/anaconda3/envs/flownet2/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1261, in _log_hyperparams self.logger.save() File "/home/fxx/anaconda3/envs/flownet2/lib/python3.6/site-packages/pytorch_lightning/utilities/distributed.py", line 50, in wrapped_fn return fn(*args, **kwargs) File "/home/fxx/anaconda3/envs/flownet2/lib/python3.6/site-packages/pytorch_lightning/loggers/csv_logs.py", line 211, in save self.experiment.save() File "/home/fxx/anaconda3/envs/flownet2/lib/python3.6/site-packages/pytorch_lightning/loggers/csv_logs.py", line 87, in save save_hparams_to_yaml(hparams_file, self.hparams) File "/home/fxx/anaconda3/envs/flownet2/lib/python3.6/site-packages/pytorch_lightning/core/saving.py", line 389, in save_hparams_to_yaml yaml.dump(v) File "/home/fxx/anaconda3/envs/flownet2/lib/python3.6/site-packages/yaml/__init__.py", line 253, in dump return dump_all([data], stream, Dumper=Dumper, **kwds) File "/home/fxx/anaconda3/envs/flownet2/lib/python3.6/site-packages/yaml/__init__.py", line 241, in dump_all dumper.represent(data) File "/home/fxx/anaconda3/envs/flownet2/lib/python3.6/site-packages/yaml/representer.py", line 27, in represent node = self.represent_data(data) File "/home/fxx/anaconda3/envs/flownet2/lib/python3.6/site-packages/yaml/representer.py", line 48, in represent_data node = self.yaml_representers[data_types[0]](self, data) File "/home/fxx/anaconda3/envs/flownet2/lib/python3.6/site-packages/yaml/representer.py", line 199, in represent_list return self.represent_sequence('tag:yaml.org,2002:seq', data) File "/home/fxx/anaconda3/envs/flownet2/lib/python3.6/site-packages/yaml/representer.py", line 92, in represent_sequence node_item = self.represent_data(item) File "/home/fxx/anaconda3/envs/flownet2/lib/python3.6/site-packages/yaml/representer.py", line 52, in represent_data node = self.yaml_multi_representers[data_type](self, data) File "/home/fxx/anaconda3/envs/flownet2/lib/python3.6/site-packages/yaml/representer.py", line 343, in represent_object 'tag:yaml.org,2002:python/object:'+function_name, state) File "/home/fxx/anaconda3/envs/flownet2/lib/python3.6/site-packages/yaml/representer.py", line 118, in represent_mapping node_value = self.represent_data(item_value) File "/home/fxx/anaconda3/envs/flownet2/lib/python3.6/site-packages/yaml/representer.py", line 52, in represent_data node = self.yaml_multi_representers[data_type](self, data) File "/home/fxx/anaconda3/envs/flownet2/lib/python3.6/site-packages/yaml/representer.py", line 346, in represent_object return self.represent_sequence(tag+function_name, args) File "/home/fxx/anaconda3/envs/flownet2/lib/python3.6/site-packages/yaml/representer.py", line 92, in represent_sequence node_item = self.represent_data(item) File "/home/fxx/anaconda3/envs/flownet2/lib/python3.6/site-packages/yaml/representer.py", line 52, in represent_data node = self.yaml_multi_representers[data_type](self, data) File "/home/fxx/anaconda3/envs/flownet2/lib/python3.6/site-packages/yaml/representer.py", line 343, in represent_object 'tag:yaml.org,2002:python/object:'+function_name, state) File "/home/fxx/anaconda3/envs/flownet2/lib/python3.6/site-packages/yaml/representer.py", line 118, in represent_mapping node_value = self.represent_data(item_value) File "/home/fxx/anaconda3/envs/flownet2/lib/python3.6/site-packages/yaml/representer.py", line 52, in represent_data node = self.yaml_multi_representers[data_type](self, data) File "/home/fxx/anaconda3/envs/flownet2/lib/python3.6/site-packages/yaml/representer.py", line 330, in represent_object dictitems = dict(dictitems) ValueError: dictionary update sequence element #0 has length 1; 2 is required

Etienne-Meunier-Inria commented 1 year ago

Hi ! Thank you for your message, it seems like it's an error related to pytorch lightning. Can you try running the code using the specs given in the "Environment" part of the readme ?

pytorch_lightning==1.2.8
pandas==0.24.1
flowiz
wandb==0.10.26
ipdb==0.13.5
torch==1.8.1
torchvision==0.9.1
seaborn

fanxuxiang commented 1 year ago

Yes, thank you. This is an error while saving the training logs. Skipping the saving of some parameters can temporarily avoid this error. Also, can the algorithm be accelerated using GPU during inference? It takes seconds for me to test using one optical flow of 1960 * 1020.

Etienne-Meunier-Inria commented 1 year ago

Happy you manage to deal with the error. At inference the algorithm for segmentation is just the forward pass of the backbone ( in our case a classical U-Net ), you don't need to compute the loss / motion models. Thus, you can use GPU acceleration as you usually do with Pytorch models. If you want to further accelerate inference you can either reduce the input size or train a lighter backbone model.

Etienne-Meunier-Inria / EM-Flow-Segmentation

An error occurred while saving the training model #3