Closed NeverMoreLCH closed 1 year ago
Dear @NeverMoreLCH thank you for reaching out. Kindly provide me with a snapshot of your environment and the command you are using to train the models.
I will try and reproduce your issue on my end and give you a solution soon.
Dear @Soldelli , here is my conda environment. I use 'python train_net.py --config-file configs/activitynet.yml OUTPUT_DIR outputs/activitynet' and 'python train_net.py --config-file configs/tacos.yml OUTPUT_DIR outputs/tacos' to train the models. And I just change the batch size to 16 when training on TACoS.
Thanks! env_vlg.yaml.TXT
Dear @NeverMoreLCH I apologize for the delay. I verified that I could reproduce evaluation and training on my side using my environment. I was not able to easily install the environment you provided, but I will try my best to create one manually and match all the libraries.
I will do my best to help you. We can schedule a zoom call if you cannot solve the issues. If that is the case, please shoot me an email.
Dear @Soldelli , this is my training logs for activitynet, download link. In addition, I can confirm that the test results with your pre-trained models are same as that in your repo.
Thanks for your reply!
Dear @NeverMoreLCH I notice from the logs that the loss is decreasing while the performance in evaluation is stable and not changing. Moreover, such performance are on par with random selection of the proposals. In my experience, after the first epoch you should reach at least 14% for R@1 IoU=0.7.
lib/engine/trainer.py
file. Did you make any changes to those functions? I suggest cloning the repo from scratch and trying to train before making any change. According to my test, it should work off the shelf. Feel free to reach out anytime.
Best, Mattia
Excellent work!
I train the VLG-Net with your code and the provided resources, and I have two issues needing your help.
- I train the VLG-Net on activitynet1.3 with your code, but only get Rank@1,mIoU@0.5=15.94 and Rank@1,mIoU@0.7=4.88 when inference. I don't know what's wrong and how to fix it.
- When I train the VLG-Net on the TACoS dataset, the training loss of the first two batches is seemingly normal at epoch 1, and then it becomes to be 'nan'.
Thanks!!!
hello! I also meet the problem 2. Could you tell me how you solve it finally?
Hi @tujun233 I am currently out of office so I cannot delve into the code right now.
Did you try reducing the learning rate as suggested?
Otherwise if that does not help, try debugging with IPDB setting the breakpoint in the loss calculation line and double checking why the insurgence of Nan happens. For this, I suggest setting the number of workers to zero in the config file. This created a data loader with a single thread and allows you to set breakpoints in any part of the code (even inside the training loop or model functions).
Hi @tujun233 I am currently out of office so I cannot delve into the code right now.
- Did you try reducing the learning rate as suggested?
- Otherwise if that does not help, try debugging with IPDB setting the breakpoint in the loss calculation line and double checking why the insurgence of Nan happens. For this, I suggest setting the number of workers to zero in the config file. This created a data loader with a single thread and allows you to set breakpoints in any part of the code (even inside the training loop or model functions).
Hi! I try reducing the learning rates as suggested. But it doesn't work. Through my debugging, I found that the GloVe tokens were fed into an LSTM with 5 layers in lstm_syntacGCN_encoder in /lib/modeling/language_modeling.py. Its outputs include many NaN. It happens when I train tacos datasets and change the batchsize to 8 in tacos config file. I don't know why, and I will try whether it will still happen when using original batchsize 32.
Dear @Soldelli , I clone the repo from scratch and trying to train before making any change. But I meet the same two problems as @NeverMoreLCH .
Dear @tujun233 I apologize for the inconvenience. I will reach out with a solution as soon as I am back in the office.
Just to make sure:
I never faced the issue you are reporting so I need to be able to reproduce it before finding a solution. When I cloned from scratch the repo I was able to train the model correctly.
Best, Mattia
Dear @Soldelli , I apologize for my delay, Yes, I am sure that using default config file and the same versions of the libraries as reported. But I don't have v100, I train it on single RTX 2080 Ti. I also try my best to solve the problem I meet.
Hi! There is a good news that I have solved the problems. I install pytorch1.7.1 and torchvision0.8.2 on the official web. Then the code works. I guess that I use https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch/ to install pytorch1.5.1 before for saving time. Maybe there is something wrong with the version of pytorch. And thank you for help! Best, tujun233
Dear @tujun233 that is great news indeed. You might need to run hyperparameters search on the learning parameters now that you bumped down the batch size. Please feel free to reach out for any other doubt or issue.
Cheers.
Excellent work!
I train the VLG-Net with your code and the provided resources, and I have two issues needing your help.
Thanks!!!