training issues on activitynet1.3 and TACoS

NeverMoreLCH commented 2 years ago

Excellent work!

I train the VLG-Net with your code and the provided resources, and I have two issues needing your help.

I train the VLG-Net on activitynet1.3 with your code, but only get Rank@1,mIoU@0.5=15.94 and Rank@1,mIoU@0.7=4.88 when inference. I don't know what's wrong and how to fix it.
When I train the VLG-Net on the TACoS dataset, the training loss of the first two batches is seemingly normal at epoch 1, and then it becomes to be 'nan'.

Thanks!!!

Soldelli commented 2 years ago

Dear @NeverMoreLCH thank you for reaching out. Kindly provide me with a snapshot of your environment and the command you are using to train the models.

I will try and reproduce your issue on my end and give you a solution soon.

NeverMoreLCH commented 2 years ago

Dear @Soldelli , here is my conda environment. I use 'python train_net.py --config-file configs/activitynet.yml OUTPUT_DIR outputs/activitynet' and 'python train_net.py --config-file configs/tacos.yml OUTPUT_DIR outputs/tacos' to train the models. And I just change the batch size to 16 when training on TACoS.

Thanks! env_vlg.yaml.TXT

Soldelli commented 2 years ago

Dear @NeverMoreLCH I apologize for the delay. I verified that I could reproduce evaluation and training on my side using my environment. I was not able to easily install the environment you provided, but I will try my best to create one manually and match all the libraries.

I kindly ask you to provide me with the training logs (please just zip the output folder and send it to me).
Can you confirm you can run an inference with the pre-trained models and obtain the numbers mentioned in the repo?
Regarding TACoS, I believe something funky might be happening. Note that modifying the batch size call for a rescaling of the learning rate. Please do try to search across different hyperparameters settings.

I will do my best to help you. We can schedule a zoom call if you cannot solve the issues. If that is the case, please shoot me an email.

NeverMoreLCH commented 2 years ago

Dear @Soldelli , this is my training logs for activitynet, download link. In addition, I can confirm that the test results with your pre-trained models are same as that in your repo.

Thanks for your reply!

Soldelli commented 2 years ago

Dear @NeverMoreLCH I notice from the logs that the loss is decreasing while the performance in evaluation is stable and not changing. Moreover, such performance are on par with random selection of the proposals. In my experience, after the first epoch you should reach at least 14% for R@1 IoU=0.7.

Please check that the updated model is correctly passed from the training script to the evaluation script at each inference call in the lib/engine/trainer.py file. Did you make any changes to those functions?
I believe there is no mistake in training nor in inference as the loss is decreasing correctly and you are able to reproduce the result of the pre-trained models.

I suggest cloning the repo from scratch and trying to train before making any change. According to my test, it should work off the shelf. Feel free to reach out anytime.

Best, Mattia

tujun233 commented 2 years ago

Excellent work!

I train the VLG-Net with your code and the provided resources, and I have two issues needing your help.

I train the VLG-Net on activitynet1.3 with your code, but only get Rank@1,mIoU@0.5=15.94 and Rank@1,mIoU@0.7=4.88 when inference. I don't know what's wrong and how to fix it.

When I train the VLG-Net on the TACoS dataset, the training loss of the first two batches is seemingly normal at epoch 1, and then it becomes to be 'nan'.

Thanks!!!

hello! I also meet the problem 2. Could you tell me how you solve it finally?

Soldelli commented 2 years ago

Hi @tujun233 I am currently out of office so I cannot delve into the code right now.

Did you try reducing the learning rate as suggested?
Otherwise if that does not help, try debugging with IPDB setting the breakpoint in the loss calculation line and double checking why the insurgence of Nan happens. For this, I suggest setting the number of workers to zero in the config file. This created a data loader with a single thread and allows you to set breakpoints in any part of the code (even inside the training loop or model functions).

tujun233 commented 2 years ago

Hi @tujun233 I am currently out of office so I cannot delve into the code right now.

Did you try reducing the learning rate as suggested?

Otherwise if that does not help, try debugging with IPDB setting the breakpoint in the loss calculation line and double checking why the insurgence of Nan happens. For this, I suggest setting the number of workers to zero in the config file. This created a data loader with a single thread and allows you to set breakpoints in any part of the code (even inside the training loop or model functions).

Hi! I try reducing the learning rates as suggested. But it doesn't work. Through my debugging, I found that the GloVe tokens were fed into an LSTM with 5 layers in lstm_syntacGCN_encoder in /lib/modeling/language_modeling.py. Its outputs include many NaN. It happens when I train tacos datasets and change the batchsize to 8 in tacos config file. I don't know why, and I will try whether it will still happen when using original batchsize 32.

tujun233 commented 2 years ago

Dear @Soldelli , I clone the repo from scratch and trying to train before making any change. But I meet the same two problems as @NeverMoreLCH .

Soldelli commented 2 years ago

Dear @tujun233 I apologize for the inconvenience. I will reach out with a solution as soon as I am back in the office.

Just to make sure:

Are you training with the default config file?
Are you training on a v100 (due to memory requirements)?
Are you using the same versions of the libraries as reported in the documentation?

I never faced the issue you are reporting so I need to be able to reproduce it before finding a solution. When I cloned from scratch the repo I was able to train the model correctly.

Best, Mattia

tujun233 commented 2 years ago

Dear @Soldelli , I apologize for my delay, Yes, I am sure that using default config file and the same versions of the libraries as reported. But I don't have v100, I train it on single RTX 2080 Ti. I also try my best to solve the problem I meet.

tujun233 commented 2 years ago

Hi! There is a good news that I have solved the problems. I install pytorch1.7.1 and torchvision0.8.2 on the official web. Then the code works. I guess that I use https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch/ to install pytorch1.5.1 before for saving time. Maybe there is something wrong with the version of pytorch. And thank you for help! Best, tujun233

Soldelli commented 2 years ago

Dear @tujun233 that is great news indeed. You might need to run hyperparameters search on the learning parameters now that you bumped down the batch size. Please feel free to reach out for any other doubt or issue.

Cheers.

Soldelli / VLG-Net

training issues on activitynet1.3 and TACoS #1