[Training] Stops with an error : The algorithm failed to converge because the input matrix contained non-finite values.

Yashaswini-Srirangarajan commented 9 months ago

Running python -m train --cfg configs/config_h3d_stage1.yaml --nodebug after setting up the database proceeds training for 9 epochs and runs into the below error.

1 Loading HumanML3D train ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.7/3.7 GB 0:00:00 2 [?25hPointer Pointing at 0 3 Loading HumanML3D test ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 702.9/702.9 MB 0:00:00 4 [?25hPointer Pointing at 0 5 ┏━━━┳━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┓ 6 ┃ ┃ Name ┃ Type ┃ Params ┃ 7 ┡━━━╇━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━┩ 8 │ 0 │ metrics │ BaseMetrics │ 65.1 M │ 9 │ 1 │ vae │ VQVae │ 19.4 M │ 10 │ 2 │ lm │ MLM │ 248 M │ 11 │ 3 │ _losses │ ModuleDict │ 0 │ 12 └───┴─────────┴─────────────┴────────┘ 13 Trainable params: 267 M 14 Non-trainable params: 65.1 M 15 Total params: 332 M 16 Total estimated model params size (MB): 1.3 K 17 LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] 18 2024-01-07 16:06:24,603 Sanity checking ok. 19 Epoch 9/999998 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 88/88 0:00:12 • 0:00:00 7.37it/s
20 2024-01-07 16:06:25,531 Training started 21 2024-01-07 16:06:37,118 Epoch 0: 22 2024-01-07 16:06:50,415 Epoch 1: loss_total 8.192e-01 23 2024-01-07 16:07:02,251 Epoch 2: loss_total 6.196e-01 24 2024-01-07 16:07:14,052 Epoch 3: loss_total 5.443e-01 25 2024-01-07 16:07:25,955 Epoch 4: loss_total 4.995e-01 26 2024-01-07 16:07:37,948 Epoch 5: loss_total 4.704e-01 27 2024-01-07 16:07:50,044 Epoch 6: loss_total 4.477e-01 28 2024-01-07 16:08:02,174 Epoch 7: loss_total 4.288e-01 29 2024-01-07 16:08:14,288 Epoch 8: loss_total 4.166e-01 30 Validation ━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 29/146 0:00:04 • 0:00:20 6.10it/s 31 [?25h 32 Traceback (most recent call last): 33 File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/runpy.py", line 196, in _run_module_as_main 34 return _run_code(code, main_globals, None, 35 File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/runpy.py", line 86, in _run_code 36 exec(code, run_globals) 37 File "/home/yasha/workspace/mocap/MotionGPT/train.py", line 94, in 38 main() 39 File "/home/yasha/workspace/mocap/MotionGPT/train.py", line 85, in main 40 trainer.fit(model, datamodule=datamodule) 41 File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit 42 call._call_and_handle_interrupt( 43 File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 44, in _call_and_handle_interrupt 44 return trainer_fn(*args, kwargs) 45 File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl 46 self._run(model, ckpt_path=ckpt_path) 47 File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 989, in _run 48 results = self._run_stage() 49 File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1035, in _run_stage 50 self.fit_loop.run() 51 File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 202, in run 52 self.advance() 53 File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 359, in advance 54 self.epoch_loop.run(self._data_fetcher) 55 File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 137, in run 56 self.on_advance_end(data_fetcher) 57 File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 285, in on_advance_end 58 self.val_loop.run() 59 File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/loops/utilities.py", line 182, in _decorator 60 return loop_run(self, *args, kwargs) 61 File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 134, in run 62 self._evaluation_step(batch, batch_idx, dataloader_idx, dataloader_iter) 63 File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 391, in _evaluation_step 64 output = call._call_strategy_hook(trainer, hook_name, step_args) 65 File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 309, in _call_strategy_hook 66 output = fn(args, kwargs) 67 File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 403, in validation_step 68 return self.lightning_module.validation_step(*args, kwargs) 69 File "/home/yasha/workspace/mocap/MotionGPT/mGPT/models/base.py", line 28, in validation_step 70 return self.allsplit_step("val", batch, batch_idx) 71 File "/home/yasha/workspace/mocap/MotionGPT/mGPT/models/mgpt.py", line 454, in allsplit_step 72 metric).update(rs_set["joints_rst"], 73 File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/torchmetrics/metric.py", line 470, in wrapped_func 74 raise err 75 File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/torchmetrics/metric.py", line 460, in wrapped_func 76 update(*args, **kwargs) 77 File "/home/yasha/workspace/mocap/MotionGPT/mGPT/metrics/mr.py", line 96, in update 78 self.PAMPJPE += torch.sum(calc_pampjpe(rst[i], ref[i])) 79 File "/home/yasha/workspace/mocap/MotionGPT/mGPT/metrics/utils.py", line 397, in calc_pampjpe 80 preds_tranformed, PA_transform = batch_compute_similarity_transform_torch( 81 File "/home/yasha/workspace/mocap/MotionGPT/mGPT/metrics/utils.py", line 296, in batch_compute_similarity_transform_torch 82 U, s, V = torch.svd(K) 83 torch._C._LinAlgError: linalg.svd: (Batch element 0): The algorithm failed to converge because the input matrix contained non-finite values.

How do we fix this ?

zybermonk commented 8 months ago

Hi @Yashaswini-Srirangarajan, Noticed a lot of people encountered this issue, including myself. Only fix was to change the 'test' split to 'val' in the config files. Check this for more details: https://github.com/OpenMotionLab/MotionGPT/issues/22#issuecomment-1872937170

However, this seems to be a strange error as even after manually checking for errors (non-finite values) in the data, and also using a different dataset, this error keeps resurfacing.

Asking @billl-jiang for any support with this issue and debugging. Cheers.

zybermonk commented 8 months ago

UPDATE:

Fixed this problem by checking all the .npy files for NAN values and other anomalies with respect to their corresponding names in the .txt files (train, val and test).
Once found the faulty files, remove them from: texts, new_joints, new_joint_vecsand also in the .txt files.-
In the end all your files and the names should be pointing to same number of samples.
Finally most important is to is delete the 'tmp' folder created during the training runs, every time you alter the data.

Yashaswini-Srirangarajan commented 8 months ago

@zybermonk Thanks for the inputs.. How did you debug for NANs. Looks like all my files in new_joint_vecs and new_joints don't have NANs. I am missing any step from generating the HumanML3D dataset? Thanks a lot!

UPDATE:

Fixed this problem by checking all the .npy files for NAN values and other anomalies with respect to their corresponding names in the .txt files (train, val and test).

Once found the faulty files, remove them from: texts, new_joints, new_joint_vecsand also in the .txt files.-

In the end all your files and the names should be pointing to same number of samples.

Finally most important is to is delete the 'tmp' folder created during the training runs, every time you alter the data.

Yashaswini-Srirangarajan commented 8 months ago

Tried this approach as well, but I seem to getting some other error as below. Had you faced this before? Thanks!


Trainable params: 267 M                                                         
Non-trainable params: 65.1 M                                                    
Total params: 332 M                                                             
Total estimated model params size (MB): 1.3 K                                   
Sanity Checking ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2/2 0:00:02 • 0:00:00 1.64it/s 2024-01-30 16:40:28,994 Sanity checking ok.
/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_l
ightning/loops/fit_loop.py:293: The number of training batches (1) is smaller 
than the logging interval Trainer(log_every_n_steps=50). Set a lower value for 
log_every_n_steps if you want to see logs for the training epoch.
2024-01-30 16:40:29,481 Training started
Epoch 0/999998 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/1 0:00:00 • 0:00:00 0.00it/s 
Traceback (most recent call last):
  File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/yasha/workspace/mocap/MotionGPT/train.py", line 94, in <module>
    main()
  File "/home/yasha/workspace/mocap/MotionGPT/train.py", line 85, in main
    trainer.fit(model, datamodule=datamodule)
  File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit
    call._call_and_handle_interrupt(
  File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 44, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 989, in _run
    results = self._run_stage()
  File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1035, in _run_stage
    self.fit_loop.run()
  File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 202, in run
    self.advance()
  File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 359, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 137, in run
    self.on_advance_end(data_fetcher)
  File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 285, in on_advance_end
    self.val_loop.run()
  File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/loops/utilities.py", line 182, in _decorator
    return loop_run(self, *args, **kwargs)
  File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 141, in run
    return self.on_run_end()
  File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 253, in on_run_end
    self._on_evaluation_epoch_end()
  File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 329, in _on_evaluation_epoch_end
    call._call_lightning_module_hook(trainer, hook_name)
  File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 157, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/home/yasha/workspace/mocap/MotionGPT/mGPT/models/base.py", line 54, in on_validation_epoch_end
    dico.update(self.metrics_log_dict())
  File "/home/yasha/workspace/mocap/MotionGPT/mGPT/models/base.py", line 114, in metrics_log_dict
    metrics_dict = getattr(
  File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/torchmetrics/metric.py", line 610, in wrapped_func
    value = _squeeze_if_scalar(compute(*args, **kwargs))
  File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/yasha/workspace/mocap/MotionGPT/mGPT/metrics/t2m.py", line 195, in compute
    metrics["FID"] = calculate_frechet_distance_np(gt_mu, gt_cov, mu, cov)
  File "/home/yasha/workspace/mocap/MotionGPT/mGPT/metrics/utils.py", line 205, in calculate_frechet_distance_np
    raise ValueError("Imaginary component {}".format(m))
ValueError: Imaginary component 1.836488313288817e+26

Hi @Yashaswini-Srirangarajan, Noticed a lot of people encountered this issue, including myself. Only fix was to change the 'test' split to 'val' in the config files. Check this for more details: #22 (comment)

However, this seems to be a strange error as even after manually checking for errors (non-finite values) in the data, and also using a different dataset, this error keeps resurfacing.

Asking @billl-jiang for any support with this issue and debugging. Cheers.

SuperIRabbit commented 7 months ago

@Yashaswini-Srirangarajan I hit the same issue and using scipy==1.11.1 solved my problem, although I'm not sure which version is mathematically more correct. See: https://github.com/scipy/scipy/issues/19415 https://github.com/mseitzer/pytorch-fid/issues/103

zybermonk commented 7 months ago

@zybermonk Thanks for the inputs.. How did you debug for NANs. Looks like all my files in new_joint_vecs and new_joints don't have NANs. I am missing any step from generating the HumanML3D dataset? Thanks a lot!

Hi @Yashaswini-Srirangarajan, sorry for the late response. When you build HumanML3D, by default there will be a few files that contain faulty data. You can first notice this during the data building process itself, for example, while using the 3rd notebook of HumanML3D you can see the following output -

Evidently, the .npy files with suffixes 7975, contained NAN data when verified using np.isfinite() or similar. Following this method, you need to verify all your .npy files in new_joints and new_joint_vecs, corresponding to the file names in the train, test and val .txt files.

You will find the following files also have faulty data, as encountered previously after using the 2nd notebook from HumanML3D

Next step would be to delete these files in .npy folders, and also filenames in the .txt files.

Most importantly, as I previously mentioned, make sure you delete the tmp folder before running your code with new edited dataset

lucascolley commented 7 months ago

I hit the same issue and using scipy==1.11.1 solved my problem, although I'm not sure which version is mathematically more correct

If anyone has any input on which version is more mathematically correct, that would be great.

zybermonk commented 7 months ago

I hit the same issue and using scipy==1.11.1 solved my problem, although I'm not sure which version is mathematically more correct

If anyone has any input on which version is more mathematically correct, that would be great.

Just adding to this question, changing these libraries indirectly requires finding the right numpy version as well.

lucascolley commented 7 months ago

At least a partial fix has come through at https://github.com/scipy/scipy/pull/20212. We recommend trying again once SciPy 1.13.0 is released, to see whether the problems are gone.

Yashaswini-Srirangarajan commented 6 months ago

At least a partial fix has come through at scipy/scipy#20212. We recommend trying again once SciPy 1.13.0 is released, to see whether the problems are gone.

@lucascolley, This fix now works for me :) thanks !!

lucascolley commented 6 months ago

fantastic - 1.13.0 should be out within the next few weeks

lucascolley commented 6 months ago

fantastic - 1.13.0 should be out within the next few weeks

It was just released.

OpenMotionLab / MotionGPT

[Training] Stops with an error : The algorithm failed to converge because the input matrix contained non-finite values. #69