wxjames commented 4 years ago

When I tried to use the pre-trained classification model on Jester to train EgoGesture dataset, it showed that

RuntimeError: Error(s) in loading state_dict for DataParallel: size mismatch for module.fc.weight: copying a param with shape torch.Size([27, 2048]) from checkpoint, the shape in current model is torch.Size([83, 2048]). size mismatch for module.fc.bias: copying a param with shape torch.Size([27]) from checkpoint, the shape in current model is torch.Size([83]).

It seems it is because Jester and EgoGesture have different classes of gestures. So how should I change this parameter?

My code is shown like this:

!/bin/bash

python main.py \ --root_path ~/ \ --video_path /home/wisccitl/Desktop/EgoGesture \ --annotation_path Real-time-GesRec/annotation_EgoGesture/egogestureall_but_None.json \ --result_path Real-time-GesRec/results \ --resume_path Real-time-GesRec/models/jester_resnext_101_RGB_32.pth \ --dataset egogesture \ --sample_duration 32 \ --learning_rate 0.01 \ --model resnext \ --model_depth 101 \ --resnet_shortcut B \ --batch_size 64 \ --n_classes 83 \ --n_finetune_classes 83 \ --n_threads 16 \ --checkpoint 1 \ --modality RGB \ --train_crop random \ --n_val_samples 1 \ --test_subset test \ --n_epochs 100 \

ahmetgunduz commented 4 years ago

you have to provide the jester model as pretrain_path not as resume_path

wxjames commented 4 years ago

Hi Ahmet,

Thanks for your kind reply. But when I change resume_path as pretrain_path, it still shows:

RuntimeError: Error(s) in loading state_dict for DataParallel: size mismatch for module.fc.weight: copying a param with shape torch.Size([27, 2048]) from checkpoint, the shape in current model is torch.Size([83, 2048]). size mismatch for module.fc.bias: copying a param with shape torch.Size([27]) from checkpoint, the shape in current model is torch.Size([83]).

My code is shown like this:

!/bin/bash

python main.py \ --root_path ~/ \ --video_path /home/wisccitl/Desktop/EgoGesture \ --annotation_path Real-time-GesRec/annotation_EgoGesture/egogestureall_but_None.json \ --result_path Real-time-GesRec/results \ --pretrain_path Real-time-GesRec/models/jester_resnext_101_RGB_32.pth \ --dataset egogesture \ --sample_duration 32 \ --learning_rate 0.01 \ --model resnext \ --model_depth 101 \ --resnet_shortcut B \ --batch_size 64 \ --n_classes 83 \ --n_finetune_classes 83 \ --n_threads 16 \ --checkpoint 1 \ --modality RGB \ --train_crop random \ --n_val_samples 1 \ --test_subset test \ --n_epochs 100 \

ahmetgunduz commented 4 years ago

I believe you also need to change either n_classes or n_finetune_classes to 27

wxjames commented 4 years ago

Yeah, thanks so much, Ahmet. It works when I changed n_classes to 27. I also figured it out a few days ago. But I came across another problem. The accuracy of my model is very poor. I trained the model using the pretrained Jester model as pretrain_path for 100 epochs. Training accuracy is about 0.006. Then I continued to train the model using the saved model as resume_path for 100 epochs. But the accuracy does not improve. The paper said "After pretraining on Jester dataset, training is completed after 5 more epochs". So it must be somewhere I missed it. The same thing happens for the training of detector. I trained the detector without pretrained model for 100 epochs. But the test accuracy is about 0.5. Could you please give me any suggestions?

wxjames commented 4 years ago

Train.log is like this: epoch loss acc precision recall lr 1 4.442830331317593 0.005688124306326304 0.0016544988632515136 0.004992128306654597 0.01 2 4.443272197418551 0.006659267480577136 0.0017789094312178268 0.006099737649016919 0.01 3 4.443384901929511 0.006104328523862375 0.0018375945540200429 0.00497895665099739 0.01 4 4.441852415044617 0.005826859045504994 0.0013613769776548484 0.005415032942820301 0.01 5 4.443037118287251 0.005341287458379578 0.0015028268158382163 0.004554899511051034 0.01

wxjames commented 4 years ago

I believe you also need to change either n_classes or n_finetune_classes to 27

I still did not figure it out about the training accuracy.

ahmetgunduz commented 4 years ago

Did you change any other thing? I do not see any problem in the parameters. Can you please check your dataset if it is properly installed and preprocessed?

wxjames commented 4 years ago

Hi Ahmet,

Thanks again for your answers. I think the location and the process of dataset should be good because the code runs well when testing is conducted and using the pretrained EgoGesture model as resume_path. But when I do the transfer learning using the pretrained Jester model, the accuracy would just not increase.

Maybe it is caused by these: when I first run the code, the error occurs saying "RuntimeError: CUDA out of memory. Tried to allocate 784.00 MiB (GPU 0; 11.89 GiB total capacity; 10.58 GiB already allocated; 187.56 MiB free; 10.60 GiB reserved in total by PyTorch)." and "RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn".

So I added 2 rows of codes in train.py like this:

    with torch.no_grad():
        outputs = model(inputs)
    loss = criterion(outputs, targets)
    loss = Variable(loss, requires_grad=True)

The first and the last lines are the codes I added. After that, the code could be conducted successfully. I think this may cause the loss function cannot decrease.

Below is the information of my CUDA. Could you help me with that? Or should I increase my CUDA memory? Screenshot from 2020-04-17 18-13-03

wxjames commented 4 years ago

Could you please share your CUDA version and total memory?

wxjames commented 4 years ago

I finally figured it out. The accuracy did not improve because of these two rows. To solve the problem of limited CUDA memory, I decreased the batch_size. Now the code runs well and the accuracy improves although the training epoches needed increase. Anyway thanks for your suggestions, Ahmet. I am going to close this issue.

ahmetgunduz / Real-time-GesRec

How the pre-trained model on Jester could be used to train EgoGesture? #50

!/bin/bash

!/bin/bash