Closed wlin-at closed 2 years ago
We use one video at each iteration because performing the Viterbi algorithm on a video is slow. If we use all videos in one iteration, it will take a very long time. Anyway, to update more than one videos for each iteration is worth trying.
@wlin93 Hello, did you run "NeuralNetwork-Viterbi" paper's code? I downloaded the dataset they provided(Breakfast dataset) and ran their code but only got 0.375708 frame accuracy.
@JunLi-Galios Hello, I noticed that your iteration is 100K while "NeuralNetwork-Viterbi" use 10K, is your loss much harder to minimize? I am running your code and it really taks much time, maybe 3 days for me......
@wlin93 Hello, did you run "NeuralNetwork-Viterbi" paper's code? I downloaded the dataset they provided(Breakfast dataset) and ran their code but only got 0.375708 frame accuracy.
Hi, I recall that I could not reproduce the accuracy in the paper either. However, the author claimed that they used C++ implementation instead of the Python code in the repo.
@jszgz @wlin93 I think "NeuralNetwork-Viterbi" reports their max accuracy. You may refer to https://arxiv.org/pdf/1904.03116.pdf for the statistical learning of the related work.
@jszgz We use more iterations because we don't use a buffer during training. In "NeuralNetwork-Viterbi", the authors use a buffer to store pseudo ground truth in previous iterations and reuse them during training.
Evaluate 252 video files... frame accuracy: 0.503284
@JunLi-Galios Hello, I ran your code on breakfast dataset and got this accuracy, is it segmentation accuracy or alignment accuracy? If it is segmentation accuracy, then you've done a good job. Besides, would you consider to provide the code for alignment accuracy and other metrics in the paper?
@jszgz It's segmentation accuracy that is the main metric for Breakfast dataset. I would provide other metrics shortly.
@jszgz It's segmentation accuracy that is the main metric for Breakfast dataset. I would provide other metrics shortly.
Extremely thanks.
@JunLi-Galios Hello, I tried to train GTEA dataset with your code, and get 0.36 frame accuracy after 1600 iterations(since there are only 28 videos, which is 60 times smaller than breakfast, and learning rate is decreased after 60% of the iterations) . Do I need to adjust some parameters or optimizer?
@jszgz First, you need to tune the learning rate and the number of total iterations. Second, you need to tune frame_sampling, window_size and step, both in train and inference. The longer the videos are, the larger these hyper-parameters are.
@jszgz @wlin93 @JunLi-Galios We have done a reproducibility study and comparison of NNV, CDFL and ISBA methods in our technical report: https://arxiv.org/abs/2005.09743 Here is Table 1 of the paper comparing the reported accuracy with average and standard deviation of the accuracy for 5 trials (over all splits of breakfast)
Model | MoF Reported | MoF Avg (+- Std) |
---|---|---|
ISBA | 38.4 | 36.4 (+- 1.0) |
NNV | 43.0 | 39.7 (+- 2.4) |
CDFL | 50.2 | 48.1 (+- 2.5) |
@yassersouri Thanks for sharing your work.
Hi, thanks for the very refreshing paper. I notice that you followed the online learning pattern of "NeuralNetwork-Viterbi" paper by Richard et al. However, your training loss is designed on one entire video sequence while their cross-entropy loss is on a batch of 21-frame video chunks. I think that is why you did not sample frames from the buffer as they did. I wonder why you still kept this online training pattern. Or have you tried training from all the training videos instead of one video at each iteration? Regards