FangyunWei / SLRT

236 stars 46 forks source link

NLA-SLR: Pretrained model works fine, training gets stuck at 0.05% accuracy #57

Open foxcpp opened 4 months ago

foxcpp commented 4 months ago

Hello, I not sure where to start with troubleshooting the following issue.

I am trying to train NLA-SLR on WLASL-2000, when training Video-64 top-1 per-class accuracy seems to be stuck on 0.05 - as is, the model is not learning at all. I use configs/rgb_frame64.yaml without any changes, WLASL data is scaled to 256x256 with black padding.

I train on 2x A100 40GB with batch_size: 4. When using prediction.py to test the pretrained Video-64 model, I successfully obtain 51% accuracy therefore data must be fine. Tried to train WLASL-100, stopped at epoch 25 as there was no progress either (and validation accuracy was stuck at 1%).

I modified the code to output training accuracy and it seems that the model is overfitting like crazy with training accuracy reaching 99%.

foxcpp commented 4 months ago

I made some adjustments to the default config: doubled batch size and halved learning rate, around epoch 50 model seems to start actually learning something useful - validation accuracy goes up to 27%. Will see if I am able to reproduce paper results this way. Still looks like very overfit model.

foxcpp commented 4 months ago
    torch.backends.cuda.matmul.allow_tf32 = False
    torch.backends.cudnn.allow_tf32 = False

On top of this, it seems to be necessary to disable Ampere GPU optimizations, otherwise even training accuracy is stuck at 1% and the model is completely broken.

2000ZRL commented 3 months ago

Before training Video-64, you may try to pretrain each single stream (RGB and keypoints) separately. This progressive training strategy is very helpful.

pooyafayyaz commented 1 month ago

I have the same issue with key points, were you able to solve it @foxcpp? I used a smaller learning rate for videos and it worked, still the accuracy is not high. But for key points, it gets stuck at 0.05.