HuiGuanLab / UmURL

This is a repository contains the implementation of our ACM MM 2023 paper Unified Multi-modal Unsupervised Representation Learning for Skeleton-based Action Understanding.
Apache License 2.0
9 stars 2 forks source link

KeyError: 'module.backbone.j_emb.t_embedding.0.weight' #2

Open 223d opened 4 months ago

223d commented 4 months ago

Hi ,

During the execution evaluation of ./script_action_recognition.sh ntu60_xsub ntu60 cross_subject, an error occurred

loading './checkpoints/ntu60_xsub/checkpoint_0450.pth.tar' for sanity check Traceback (most recent call last): File "/home/inspur/ZLQ/UmURL-main/action_recognition.py", line 470, in main() File "/home/inspur/ZLQ/UmURL-main/action_recognition.py", line 128, in main main_worker(0, ngpus_per_node, args) File "/home/inspur/ZLQ/UmURL-main/action_recognition.py", line 260, in main_worker sanity_check_encoder(model.state_dict(), args.pretrained) File "/home/inspur/ZLQ/UmURL-main/action_recognition.py", line 396, in sanity_check_encoder assert ((state_dict[k].cpu() == state_dict_pre[k_pre]).all()), \ KeyError: 'module.backbone.j_emb.t_embedding.0.weight'

Thankyou!

ssk1997 commented 4 months ago

Hi ,

During the execution evaluation of ./script_action_recognition.sh ntu60_xsub ntu60 cross_subject, an error occurred

loading './checkpoints/ntu60_xsub/checkpoint_0450.pth.tar' for sanity check Traceback (most recent call last): File "/home/inspur/ZLQ/UmURL-main/action_recognition.py", line 470, in main() File "/home/inspur/ZLQ/UmURL-main/action_recognition.py", line 128, in main main_worker(0, ngpus_per_node, args) File "/home/inspur/ZLQ/UmURL-main/action_recognition.py", line 260, in main_worker sanity_check_encoder(model.state_dict(), args.pretrained) File "/home/inspur/ZLQ/UmURL-main/action_recognition.py", line 396, in sanity_check_encoder assert ((state_dict[k].cpu() == state_dict_pre[k_pre]).all()), KeyError: 'module.backbone.j_emb.t_embedding.0.weight'

Thankyou!

May I inquire if you have employed single GPU training? If so, the saved model parameters will not have a 'module' prefix, as the code defaults to multi GPU training. I suspect this is the reason for the issue. If the problem persists, please contact me. @223d

223d commented 4 months ago

Hi , During the execution evaluation of ./script_action_recognition.sh ntu60_xsub ntu60 cross_subject, an error occurred loading './checkpoints/ntu60_xsub/checkpoint_0450.pth.tar' for sanity check Traceback (most recent call last): File "/home/inspur/ZLQ/UmURL-main/action_recognition.py", line 470, in main() File "/home/inspur/ZLQ/UmURL-main/action_recognition.py", line 128, in main main_worker(0, ngpus_per_node, args) File "/home/inspur/ZLQ/UmURL-main/action_recognition.py", line 260, in main_worker sanity_check_encoder(model.state_dict(), args.pretrained) File "/home/inspur/ZLQ/UmURL-main/action_recognition.py", line 396, in sanity_check_encoder assert ((state_dict[k].cpu() == state_dict_pre[k_pre]).all()), KeyError: 'module.backbone.j_emb.t_embedding.0.weight' Thankyou!

May I inquire if you have employed single GPU training? If so, the saved model parameters will not have a 'module' prefix, as the code defaults to multi GPU training. I suspect this is the reason for the issue. If the problem persists, please contact me. @223d

Yes, due to my computer, it can only run on a single GPU. How should I solve this problem? Thankyou!

ssk1997 commented 4 months ago

If the model was not trained using DistributedDataParallel, the ‘module’ prefix is not required during sanity check. Please change k_pre = 'module.' + k to k_pre = k accordingly. @223d

https://github.com/HuiGuanLab/UmURL/blob/744278b8ad6ba3fc2290cd2469c4bf315c571935/action_recognition.py#L395

223d commented 4 months ago

If the model was not trained using DistributedDataParallel, the ‘module’ prefix is not required during sanity check. Please change k_pre = 'module.' + k to k_pre = k accordingly. @223d

https://github.com/HuiGuanLab/UmURL/blob/744278b8ad6ba3fc2290cd2469c4bf315c571935/action_recognition.py#L395

Hi, May I ask if this issue is caused by an error in checkpoint file data when I run on a single GPU?

a new problem has arisen:

=> loading './checkpoints/ntu60_xsub/checkpoint_0450.pth.tar' for sanity check Traceback (most recent call last): File "/home/inspur/ZLQ/UmURL-main/action_recognition.py", line 470, in main() File "/home/inspur/ZLQ/UmURL-main/action_recognition.py", line 128, in main main_worker(0, ngpus_per_node, args) File "/home/inspur/ZLQ/UmURL-main/action_recognition.py", line 260, in main_worker sanity_check_encoder(model.state_dict(), args.pretrained) File "/home/inspur/ZLQ/UmURL-main/action_recognition.py", line 396, in sanity_check_encoder assert ((state_dict[k].cpu() == state_dict_pre[k_pre]).all()), \ AssertionError: backbone.j_emb.t_embedding.0.weight is changed in linear classifier training.

ssk1997 commented 4 months ago

May I ask if you have modified any part of the code? It appears that the issue may be due to incorrect "requires_grad" attributes of some parameters. If you wish to run it on a single GPU, I recommend not changing any code and maintaining DDP training. You only need to modify the command in the pretrain script to CUDA_VISIBLE_DEVICES=0 torchrun --nproc_per_node=1 pretrain.py.

223d commented 4 months ago

CUDA_VISIBLE_DEVICES=0 torchrun --nproc_per_node=1 pretrain.py

Okay, thank you. I did make the code changes, so I'll retrain it.