Unable to reproduce the stated accuracy

TimandXiyu commented 2 years ago

It is very ironic that authors spent so many pages trying to convince readers that the model is state-of-the-art both in performance and speed and only to find that their code is very 'prototype' and contains multiple spelling/dismension mismatch errors, and most importantly, accuracy that cannot even match up with vanilla ResNet50 TSN.

I hope that authors could take this seriously and fix the reproducibility issues since I still believe they have actually done the work but failed to release anything that really works on others' GPUs.

TimandXiyu commented 2 years ago

Tested both on this repo's model code and my working TSN-resnet50. The model converges way slower than vanilla TSN-resnet50 and having -5% - 0 percent imporvement over vanilla TSN-resnet50. It kind of works on mini-kinetics, but only improving for 1-2 points, which could hardly beat some older methods like TSM. It simply won't work on hmdb, and it reported 5% lower accuracy that TSN-resnet50.

arunos728 commented 2 years ago

Hello TimandXiyu, We just have updated the code, so you can use the MS-TSM-resnet18,50 in our code. According to your results, it seems that 'Spatial Correlation Sampler' does not work in your code. We just have provided another matching layer (Matching_layer_mm in resnet_TSM.py) which consists of matrix multiplication & index rearrangement. Originally, the output of (Matching_layer_scs) and (Matching_layer_mm) should be the same. You can check out the result of SpatialCorrelationSampler_and_MatMul.ipynb. If the tensor outputs of two matching layers are not the same, you should re-install Spatial Correlation Sampler based on environment or just use Matching_layer_mm (line 310 instead of line 309 in resnet_TSM.py).

I hope you can get good results from this modification. Please let me know if you need any help.

TimandXiyu commented 2 years ago

Updates on my testing:

So, I validate the reliability of my installed cuda version of the scs module, and it turns out I did it correctly and the output tensor matches with the matrix multiplication based code.

Therefore, that means I still have little clue how to fix the reproducibility issues. I do tried to change the matching layer to the newly added matching layer just for verificiation and it did not work well. (I find it is slightly better than scs based matching layer, but still no where near the stated accuracy)

I do suspect that dataset loading could be a issue because I was using my custom loader which is segment based since this repo's dataloader really offers little clue for me to figure out how to configure the txt files. But based on how the loader is written, I must assume that by default this repo's loader also loads in TSN fashion. And since HMDB's video samples are realy short, I don't think loading in segment mode could result in such low accracy anyway.

I might spent a few more hours on visualizing some parts of the flow computation module, but I do have the question regarding if the conv layer in flow refinement layer really makes sense: the refinement layer steps from 16, 32, 64 suddenly to 512 (if insert to res3), Idk if this is expected and I doubt such sudden dimensional increase could ruin the refinement results.

Also, regarding the tempurature factor, in main_kinetics, you commented out the line that increase the factor and let it fixed at 100, is that what it suppose to be?

arunos728 commented 2 years ago

Hello TimandXiyu, we provide the answer below.

Data loading We use the segment sampling strategy (TSN) for Something-something v1&v2, but we use the dense frame sampling strategy for Kinetics and HMDB51. Since Kinetics videos (avg 10s) are much longer than Something V1&V2 (avg 4s), flow maps between consecutive frames could be unreliable when you use the segment sampling strategy which takes bigger time intervals. Also, the variation of HMDB51 video duration is large even in the same classes, we recommend using the dense frame sampling strategy for HMDB51. Different sampling strategies for different datasets are often used for recent methods ( TSM , MViT )for considering the characteristics of the datasets. We follow TSM repo for the detailed data preparation instructions (time interval = 8frames for 25 fps videos). You can use the dense frame sampling in train_TSM_Kinetics.sh (argument: mode=0).
HMDB51 Since HMDB51 is too small for training our module from scratch, we use Kinetics pre-trained weights when training MSNet on HMDB51. This setting is demonstrated in our paper (Sec. 4.2), and it is also a common setting in many methods (ex. I3D,R(2+1)D,STM..) that use (ImageNet weights + additional parameters). If you have trained the model only using ImageNet pre-trained weights, you could not get a good result for MSNet since it's hard to train the additional parameters. Furthermore, MSNet does not show a large gain on appearance-centric action dataset such as Kinetics (Table 2 in our paper) since our method focuses on extracting motion features. If you want to see the large gain by a simple setting, we recommend training MSNet-R18 on Something-V1 with ImageNet pre-trained weights.
Flow refinement (64 -> 512 channels) is intended by us to reduce the computational cost (FLOPs, # params). Considering other modules (ex. SENet) use the 16 expansion ratio, the 8 channel expansion ratio seems not that strange for designing our module. We also have identified that the module does not show a major accuracy gain when using 128 instead of 64.
Temperature factor The temperature factor is fixed as 100 in all experiments. The 'increase' term (the commented line) is the remained code of our temperature scaling experiments which were not included in the paper. You don't have to care about that.

TimandXiyu commented 2 years ago

Hello TimandXiyu, we provide the answer below.

Data loading We use the segment sampling strategy (TSN) for Something-something v1&v2, but we use the dense frame sampling strategy for Kinetics and HMDB51. Since Kinetics videos (avg 10s) are much longer than Something V1&V2 (avg 4s), flow maps between consecutive frames could be unreliable when you use the segment sampling strategy which takes bigger time intervals. Also, the variation of HMDB51 video duration is large even in the same classes, we recommend using the dense frame sampling strategy for HMDB51. Different sampling strategies for different datasets are often used for recent methods ( TSM , MViT )for considering the characteristics of the datasets. We follow TSM repo for the detailed data preparation instructions (time interval = 8frames for 25 fps videos). You can use the dense frame sampling in train_TSM_Kinetics.sh (argument: mode=0).

HMDB51 Since HMDB51 is too small for training our module from scratch, we use Kinetics pre-trained weights when training MSNet on HMDB51. This setting is demonstrated in our paper (Sec. 4.2), and it is also a common setting in many methods (ex. I3D,R(2+1)D,STM..) that use (ImageNet weights + additional parameters). If you have trained the model only using ImageNet pre-trained weights, you could not get a good result for MSNet since it's hard to train the additional parameters. Furthermore, MSNet does not show a large gain on appearance-centric action dataset such as Kinetics (Table 2 in our paper) since our method focuses on extracting motion features. If you want to see the large gain by a simple setting, we recommend training MSNet-R18 on Something-V1 with ImageNet pre-trained weights.

Flow refinement (64 -> 512 channels) is intended by us to reduce the computational cost (FLOPs, # params). Considering other modules (ex. SENet) use the 16 expansion ratio, the 8 channel expansion ratio seems not that strange for designing our module. We also have identified that the module does not show a major accuracy gain when using 128 instead of 64.

Temperature factor The temperature factor is fixed as 100 in all experiments. The 'increase' term (the commented line) is the remained code of our temperature scaling experiments which were not included in the paper. You don't have to care about that.

Thanks for your answer, and I will try to follow your suggested methods. Btw, I was indeed loading kinetics pretrained resnet50 instead of imagenet pretrained resnet50, so I think I had followed the paper's training scheme. (But, of course, msnet module related weights are initialized using this repo's code.) Appreciate your attention on this issue, closing now.

arunos728 commented 2 years ago

I think you have already understood, but what I mean was that you should use the Kinetics pre-trained MSNet weights for training MSNet on HMDB51, not the Kinetics pre-trained TSM (or TSN) weights.

TimandXiyu commented 2 years ago

I think you have already understood, but what I mean was that you should use the Kinetics pre-trained MSNet weights for training MSNet on HMDB51, not the Kinetics pre-trained TSM (or TSN) weights.

Okay, thanks for correction on that, since it could days to do that so I was not assuming that is the case and doing experiments on smaller dataset to develop my code with speed. If that weights is somehow still on your server, maybe you can share that since I believe other issues are also asking about this. Of course, if it is not the case, it is totally fine, I setup experiments on full kinetics.

arunos728 commented 2 years ago

Also, if you want to see more gains on kinetics, I recommend using the code of SELFY, which is the expansion of this work introduced in ICCV 2021. The code form is nearly the same as MSNet, so you could feel comfortable using it.

TimandXiyu commented 2 years ago

Very thankful for your prompt on this. It would be my fault not checking out more recent works on this.

arunos728 / MotionSqueeze

Unable to reproduce the stated accuracy #19