When try to train landmark generator with CREMA-D dataset, loss skyrocketed.

When I try to use CREMA-D instead of LRS2 to train, I meet the problem below. I have tried modifying the hyperparameters and reducing the model complexity, but neither work. I wonder if there is a suggestion for fixing it.
Project_name: landmarkT5_d512_fe1024_lay4_head4
Init dataset, filtering very short videos.....
        Complete, with available vids:  7439 

  0%|                                                                                                                                                                                                                        | 0/55 [00:00<?, ?it/s]Saved checkpoint: ./checkpoints/landmark_generation/Pro_landmarkT5_d512_fe1024_lay4_head4/landmarkT5_d512_fe1024_lay4_head4_epoch_0_checkpoint_step000000000.pth
Evaluating model for 25 epochs
                                                                                                                                                                                                                                                   /home/boot/miniconda3/envs/IP_LAP/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py:79: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.0/25 [00:00<?, ?it/s]
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:45<00:00,  1.81s/it]
eval_L1_loss 1.0433004522323608 global_step: 0██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:45<00:00,  1.71s/it]
eval_velocity_loss 3.7854259157180787 global_step: 0
epoch: 0 step: 54 running_L1_loss: 2.4995  running_velocity_loss: 7.0191 : 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 55/55 [01:02<00:00,  1.14s/it]
epoch: 1 step: 99 running_L1_loss: 4.1168  running_velocity_loss: 5.4583 :  82%|████████████████████████████████████████████████████████████████████████████████████████████████████████████                        | 45/55 [00:14<00:02,  3.35it/s]Evaluating model for 25 epochs
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:44<00:00,  1.77s/it]
eval_L1_loss 5.165444421768188 global_step: 100█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:44<00:00,  1.77s/it]
eval_velocity_loss 3.0955049562454224 global_step: 100
epoch: 1 step: 109 running_L1_loss: 4.3274  running_velocity_loss: 5.3871 : 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 55/55 [01:01<00:00,  1.12s/it]
epoch: 2 step: 164 running_L1_loss: 6.9773  running_velocity_loss: 4.7538 : 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 55/55 [00:17<00:00,  3.17it/s]
epoch: 3 step: 219 running_L1_loss: 10.1619  running_velocity_loss: 4.3199 : 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 55/55 [00:17<00:00,  3.10it/s]
epoch: 4 step: 274 running_L1_loss: 13.6431  running_velocity_loss: 4.1921 : 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 55/55 [00:17<00:00,  3.23it/s]
epoch: 5 step: 329 running_L1_loss: 17.3799  running_velocity_loss: 4.0218 : 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 55/55 [00:17<00:00,  3.22it/s]
epoch: 6 step: 384 running_L1_loss: 21.3905  running_velocity_loss: 4.0268 : 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 55/55 [00:17<00:00,  3.19it/s]
Weizhi-Zhong / IP_LAP

When try to train landmark generator with CREMA-D dataset, loss skyrocketed. #59