apple / ml-cvnets

CVNets: A library for training computer vision networks
https://apple.github.io/ml-cvnets
Other
1.76k stars 225 forks source link

Inconsistent Results for MobileVIT v1 Models #68

Open hzphzp opened 1 year ago

hzphzp commented 1 year ago

Dear Developers,

I have been running with the latest code for the MobileVIT v1 models, and I have noticed some inconsistencies in the results compared to what was reported in previous papers and what I expect based on my understanding of the models.

First, when I run the MobileVIT v1 x_small model with the latest code, I get a MACs value of 1028.243M, which is significantly different from the reported value of 0.7G in the original paper and the cited value of 0.9G in other papers. I have attached a screenshot of the result for your reference. image

However, when I revert the code back to the cvnets-v0.1 commit, and run the MobileVIT v1 x_small model again, I get a MACs value of 986.269M, which is more consistent with some of the references in the literature. I have also attached a screenshot of this result for your reference. image

Second, I also observed that when I run the MobileVIT v1 small model with the latest code, I get an accuracy of 77.47 on the ImageNet1K dataset, which is lower than the reported value of 78.4 in the paper. I have not modified the model or the configuration, so I would like to know if there have been any changes in the code for the MobileVIT v1 models.

I would greatly appreciate it if someone could provide me with an explanation for these inconsistencies and, if possible, inform me of any code changes that have been made to the MobileVIT v1 models.

Thank you for your time and assistance in this matter. I am looking forward to your response.

Best regards, Zhipeng

sacmehta commented 1 year ago

In the paper, FLOPs were reported at 224x224 input resolution. Are you using the same size?

Regarding accuracy: EMA is useful for MobileViT models (as noted in paper, all MobileViT models are with EMA). Are you evaluating models with EMA?

hzphzp commented 1 year ago

In the paper, FLOPs were reported at 224x224 input resolution. Are you using the same size?

Regarding accuracy: EMA is useful for MobileViT models (as noted in paper, all MobileViT models are with EMA). Are you evaluating models with EMA?

Hi, thanks for your reply.

Regarding the GFLOPS results, I understand that the discrepancy with the paper could be due to the different image size used for testing. However, I am still curious about the reason for the different results between the two commits with the tags cvnets-v0.1 and cvnets-v0.2 because the two results above both come with 256x256 test image size.

Regarding the accuracy issue, I can confirm that I did use EMA when running the mobilevit v1 model, and the reported accuracy was the best EMA accuracy. I did not modify the original model or configuration, and followed the default training command provided in the readme file.

sacmehta commented 1 year ago

Thanks for clarification.

Regarding FLOPs: There was an error in computing FLOPs in v0.1 for MobileViT blocks, because of which FLOPs were over-estimated in v0.1. This was fixed in v0.2. That is why you observe different flops in v0.1 and v0.2. See here.

Regarding accuracy: Accuracy differences were noted when we migrated from OpenCV (v0.1) to PIL (v0.2). Similar to other works, longer warm-up helped here. For an example configuration, see this config.

Following schedule is recommended with variable batch sampler. Here, learning rate and total epochs are the same as the paper, but with a longer warm-up schedule. Hope this helps.

scheduler:
  name: "cosine"
  is_iteration_based: false
  max_epochs: 300
  warmup_iterations: 20000 # longer warm-up
  warmup_init_lr: 0.0002
  cosine:
    max_lr: 0.002
    min_lr: 0.0002
hzphzp commented 1 year ago

Thank you for your reply, I found the bug is that I still use the "imagenet_opencv" as dataIO function.

CharlesPikachu commented 8 months ago

I was also working on re-implementing mobilevit in SSSegmentation and found it is important to set EMA while warm up iters seems not very important? (only for segmentation)