Open Yashaswini-Srirangarajan opened 9 months ago
Hi @Yashaswini-Srirangarajan, Noticed a lot of people encountered this issue, including myself. Only fix was to change the 'test' split to 'val' in the config files. Check this for more details: https://github.com/OpenMotionLab/MotionGPT/issues/22#issuecomment-1872937170
However, this seems to be a strange error as even after manually checking for errors (non-finite values) in the data, and also using a different dataset, this error keeps resurfacing.
Asking @billl-jiang for any support with this issue and debugging. Cheers.
UPDATE:
texts, new_joints, new_joint_vecs
and also in the .txt
files.- @zybermonk Thanks for the inputs.. How did you debug for NANs. Looks like all my files in new_joint_vecs and new_joints don't have NANs. I am missing any step from generating the HumanML3D dataset? Thanks a lot!
UPDATE:
- Fixed this problem by checking all the .npy files for NAN values and other anomalies with respect to their corresponding names in the .txt files (train, val and test).
- Once found the faulty files, remove them from:
texts, new_joints, new_joint_vecs
and also in the.txt
files.-- In the end all your files and the names should be pointing to same number of samples.
- Finally most important is to is delete the 'tmp' folder created during the training runs, every time you alter the data.
Tried this approach as well, but I seem to getting some other error as below. Had you faced this before? Thanks!
Trainable params: 267 M
Non-trainable params: 65.1 M
Total params: 332 M
Total estimated model params size (MB): 1.3 K
Sanity Checking ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2/2 0:00:02 • 0:00:00 1.64it/s 2024-01-30 16:40:28,994 Sanity checking ok.
/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_l
ightning/loops/fit_loop.py:293: The number of training batches (1) is smaller
than the logging interval Trainer(log_every_n_steps=50). Set a lower value for
log_every_n_steps if you want to see logs for the training epoch.
2024-01-30 16:40:29,481 Training started
Epoch 0/999998 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/1 0:00:00 • 0:00:00 0.00it/s
Traceback (most recent call last):
File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/yasha/workspace/mocap/MotionGPT/train.py", line 94, in <module>
main()
File "/home/yasha/workspace/mocap/MotionGPT/train.py", line 85, in main
trainer.fit(model, datamodule=datamodule)
File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit
call._call_and_handle_interrupt(
File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 44, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 989, in _run
results = self._run_stage()
File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1035, in _run_stage
self.fit_loop.run()
File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 202, in run
self.advance()
File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 359, in advance
self.epoch_loop.run(self._data_fetcher)
File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 137, in run
self.on_advance_end(data_fetcher)
File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 285, in on_advance_end
self.val_loop.run()
File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/loops/utilities.py", line 182, in _decorator
return loop_run(self, *args, **kwargs)
File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 141, in run
return self.on_run_end()
File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 253, in on_run_end
self._on_evaluation_epoch_end()
File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 329, in _on_evaluation_epoch_end
call._call_lightning_module_hook(trainer, hook_name)
File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 157, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "/home/yasha/workspace/mocap/MotionGPT/mGPT/models/base.py", line 54, in on_validation_epoch_end
dico.update(self.metrics_log_dict())
File "/home/yasha/workspace/mocap/MotionGPT/mGPT/models/base.py", line 114, in metrics_log_dict
metrics_dict = getattr(
File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/torchmetrics/metric.py", line 610, in wrapped_func
value = _squeeze_if_scalar(compute(*args, **kwargs))
File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/yasha/workspace/mocap/MotionGPT/mGPT/metrics/t2m.py", line 195, in compute
metrics["FID"] = calculate_frechet_distance_np(gt_mu, gt_cov, mu, cov)
File "/home/yasha/workspace/mocap/MotionGPT/mGPT/metrics/utils.py", line 205, in calculate_frechet_distance_np
raise ValueError("Imaginary component {}".format(m))
ValueError: Imaginary component 1.836488313288817e+26
Hi @Yashaswini-Srirangarajan, Noticed a lot of people encountered this issue, including myself. Only fix was to change the 'test' split to 'val' in the config files. Check this for more details: #22 (comment)
However, this seems to be a strange error as even after manually checking for errors (non-finite values) in the data, and also using a different dataset, this error keeps resurfacing.
Asking @billl-jiang for any support with this issue and debugging. Cheers.
@Yashaswini-Srirangarajan I hit the same issue and using scipy==1.11.1 solved my problem, although I'm not sure which version is mathematically more correct. See: https://github.com/scipy/scipy/issues/19415 https://github.com/mseitzer/pytorch-fid/issues/103
@zybermonk Thanks for the inputs.. How did you debug for NANs. Looks like all my files in new_joint_vecs and new_joints don't have NANs. I am missing any step from generating the HumanML3D dataset? Thanks a lot!
Hi @Yashaswini-Srirangarajan, sorry for the late response. When you build HumanML3D, by default there will be a few files that contain faulty data. You can first notice this during the data building process itself, for example, while using the 3rd notebook of HumanML3D you can see the following output -
Evidently, the .npy
files with suffixes 7975, contained NAN data when verified using np.isfinite()
or similar.
Following this method, you need to verify all your .npy
files in new_joints and new_joint_vecs, corresponding to the file names in the train, test and val .txt
files.
You will find the following files also have faulty data, as encountered previously after using the 2nd notebook from HumanML3D
Next step would be to delete these files in .npy
folders, and also filenames in the .txt
files.
I hit the same issue and using scipy==1.11.1 solved my problem, although I'm not sure which version is mathematically more correct
If anyone has any input on which version is more mathematically correct, that would be great.
I hit the same issue and using scipy==1.11.1 solved my problem, although I'm not sure which version is mathematically more correct
If anyone has any input on which version is more mathematically correct, that would be great.
Just adding to this question, changing these libraries indirectly requires finding the right numpy version as well.
At least a partial fix has come through at https://github.com/scipy/scipy/pull/20212. We recommend trying again once SciPy 1.13.0 is released, to see whether the problems are gone.
At least a partial fix has come through at scipy/scipy#20212. We recommend trying again once SciPy 1.13.0 is released, to see whether the problems are gone.
@lucascolley, This fix now works for me :) thanks !!
fantastic - 1.13.0 should be out within the next few weeks
fantastic - 1.13.0 should be out within the next few weeks
It was just released.
Running python -m train --cfg configs/config_h3d_stage1.yaml --nodebug after setting up the database proceeds training for 9 epochs and runs into the below error.