Training Loss Value - Githubissues

Rinavp commented 3 years ago

Could you please provide the value of the training loss after the training was completed or maybe a plot of your training loss curve over the 20000 epochs? Thank you!

fatandmarsiano commented 3 years ago

Probably can be trained for more iteration, the loss haven't plateaued yet.

On related note is there any reason why/how the epoch number is 20k? Each epoch elapsed only needed seconds of training. Which similar to duration for each iteration, quite different to how batch and epoch works.

ekosman commented 3 years ago

This repo tries to adequately implement the algorithm described in the original paper. Thus, it uses the parameters mentioned. You can change the epoch num if you desire. Please report your result so that I can mention them in the repo description.

The epoch number is 20K because the the number of iterations in the original paper is 20K. For the purpose of this implementation, each batch represents a forward/backward pass. I def could use a different implementation strategy but I think this way works just fine.

fatandmarsiano commented 3 years ago

Yea, I limited the epoch number to 20k because it took ~22 hours of training using V100 (compared to 24 hours max of Colab's instance). And it took around 3 times longer on my own machine. I'm planning to increase the epoch limit once I figure out how to resume training from checkpoint. Or if you already have resume-training subroutine ready, it will really help.

Another follow up question, if each batch (or epoch?) represent forward/backward pass how does the dataloader cycles through each data for the entire dataset? Does it always randomly cycle for each data after each epoch until max epoch number reached?

And thanks for the response @ekosman, great reimplementation btw

ekosman commented 3 years ago

Thanks @fatandmarsiano The torchvision dataloader outputs one sample at a time. As in the original paper, each iteration (epoch, batch here) consists of 2 bags, one bag of 30 anomaly videos, second bag of 30 normal videos. Thus, every single time the loader is called it samples the videos (feature vector here). It alternates between 2 states (normal/anomalous) and each time pulls a video from the corresponding subset. As a result, a batch will contain half normal videos and the other half will be anomalous.

Rinavp commented 3 years ago

Hi, we used the weights from your file model_ep-20000.pth from exps folder. However the auc comes out to be 0.56. Is there any problem with the file? Do you have any suggestions for this issue? img_20201118_104324

ekosman commented 3 years ago

@Rinavp Looks like the model on the master branch is the old one, before updating the pre-computed features. I'll upload the current model soon, probably this weekend.

Rinavp commented 3 years ago

Thank you!

ekosman commented 3 years ago

Please check the v2 model @Rinavp

Rinavp commented 3 years ago

Thank you! Really appreciate you taking time out to answer all my questions!

Rinavp commented 3 years ago

Hi, we used version 2 of the trained model weights after 20000 epochs from the exps folder and still found auc to be only 0.68. IMG_20201118_175040

Does this model need to be trained further to get the auc of 0.74 as shown on the homepage? Were the optimizer, learning rate and regularization parameters the same as the ones in the code on Github when you got an auc of 0.74? Could you please provide some suggestions to improve the score? Thanks!

ekosman commented 3 years ago

@Rinavp I added the loss of over 100K iterations to the README. I'm aware that the current model doesn't achieve AUC of 0.75 as reported in the original paper. This can be caused by different weights of the C3D model.

ghost commented 3 years ago

@ekosman Hi I'm trying to upload the new weights, and I'm getting this. Screenshot from 2020-12-04 16-25-44 Any idea why I'm getting this ?

Also, I guess there might be some issues with code versions. in video_demo.py you have: from data_loader import VideoIterTrain

but in data_loader.py, there is only: class VideoIter(data.Dataset):

So, it's not possible just to run the video_demo.py.

I tried to rename VideoIterTrain to VideoIter, but I'm getting errors with the arguments. So, I scanned the reposetory history and found some old data_loader.py files and there was VideoIterTrain. Eventually after some fixes, I was able to run video_demo.py.

But now, I have problems with uploading new models.

Thanks.

ekosman commented 1 year ago

The code works flawlessly after all my tests. I close this issue as the codebase has been under a big transformation recently and many bugs were fixed. I will close this issue for now. Please open a new one if you encounter any other problems!

ekosman / AnomalyDetectionCVPR2018-Pytorch

Training Loss Value #39