alexanderrichard / NeuralNetwork-Viterbi

MIT License
53 stars 19 forks source link

Results on 50 salads #3

Closed sagniklp closed 5 years ago

sagniklp commented 5 years ago

Hello Alex,

I have trained using the 50 salads split 1 for 3K iterations and the test accuracy is 0.386362. Does that look reasonable to you? Or do I have to optimize the parameters (assuming repo parameters are for breakfast data)? Please suggest. Thanks.

alexanderrichard commented 5 years ago

Hi, the results from the paper were based on a C++ implementation. The provided python implementation is much easier to read and use but results in slightly lower number (-3% on Breakfast). Still, 0.38 sounds too low for 50 Salads. 3K iterations might not be sufficient. I ran the experiments for 10K iterations in the paper.

sagniklp commented 5 years ago

Thanks. Are the parameters same for both datasets?

alexanderrichard commented 5 years ago

Yes, they were robust on all evaluated datasets.

sagniklp commented 5 years ago

In the paper, you have mentioned batch size 1 work best but this repo uses a batch size of 512. Is that a typo or is there something specific?

alexanderrichard commented 5 years ago

There are two kinds of batch sizes.

Batch-size 1 in the paper means one video at a time: Forward not 10 or 20 videos but only one video, then apply the viterbi decoding layer, then backpropagate.

The batch-size 512 in the paper is on frame-level. The implementation of the algorithm is a follows: Do the forwarding and viterbi alignment but do not store all activations for the complete video (would exceed GPU memory). Once the alignment is computed, we have a frame <-> label correspondence and can chunk the video into smaller sequences (we use 21 frames) to predict the label of the last frame in this small sequence. This trick is based on the observation that even LSTMs typically have a limited focus on the most recent 20-30 frames (a similar trick has been used in our 2017 CVPR paper, too). Nice thing now: We can batch 512 of this mini-sequences and process them together. Note that this is essentially an implementation trick to circumvent the memory issue for long video sequences.

Hope that helps :)

sagniklp commented 5 years ago

Got it. Thank you. I got confused with meaning two batch sizes.

alexanderrichard commented 5 years ago

Hey! Don't know if it's still of interest but there was a bug in the forwarding that might have been responsible for the low accuracy. Fixed it, sorry for the inconvenience.