OpenGVLab / VideoMAEv2

[CVPR 2023] VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
https://arxiv.org/abs/2303.16727
MIT License
524 stars 63 forks source link

你好!我跑的V2版本的vit_b_k400_ft.sh,最终测试final_test需要20个小时,如下面所示,然后我又跑VideoMAE的final_test,发现也差不多那么久,但是我记得之前跑测试就俩小时左右啊,这是怎么回事啊,是我记错了么,修改了一下午v2版本的然后还是这样,突然找不到原因了 #14

Closed DragonWang-cell closed 1 year ago

DragonWang-cell commented 1 year ago

Test: [4430/9263] eta: 10:39:52 loss: 0.8962 (0.9741) acc1: 75.0000 (77.9959) acc5: 100.0000 (92.2083) time: 8.2166 (0.1927 -- 84.2684) data: 7.9541 (0.0002 -- 84.0091) max mem: 2736 Test: [4440/9263] eta: 10:39:05 loss: 0.8962 (0.9739) acc1: 75.0000 (78.0033) acc5: 100.0000 (92.2174) time: 8.7052 (0.2151 -- 64.5609) data: 8.4290 (0.0002 -- 64.2934) max mem: 2736 Test: [4450/9263] eta: 10:37:37 loss: 0.6310 (0.9729) acc1: 87.5000 (78.0162) acc5: 100.0000 (92.2293) time: 9.0514 (0.1833 -- 69.0573) data: 8.7904 (0.0002 -- 68.8211) max mem: 2736 Test: [4460/9263] eta: 10:35:15 loss: 0.6533 (0.9734) acc1: 75.0000 (78.0038) acc5: 100.0000 (92.2271) time: 4.6454 (0.1833 -- 69.0573) data: 4.4049 (0.0002 -- 68.8211) max mem: 2736 Test: [4470/9263] eta: 10:34:31 loss: 1.2881 (0.9741) acc1: 75.0000 (77.9719) acc5: 100.0000 (92.2249) time: 6.6842 (0.1785 -- 99.6060) data: 6.4549 (0.0001 -- 99.3840) max mem: 2736 Test: [4480/9263] eta: 10:33:26 loss: 0.8821 (0.9734) acc1: 75.0000 (77.9848) acc5: 100.0000 (92.2367) time: 10.2560 (0.1761 -- 99.6060) data: 10.0390 (0.0001 -- 99.3840) max mem: 2736 Test: [4490/9263] eta: 10:31:32 loss: 0.3921 (0.9726) acc1: 87.5000 (78.0004) acc5: 100.0000 (92.2456) time: 7.0091 (0.1761 -- 52.3157) data: 6.7684 (0.0001 -- 52.0523) max mem: 2736 Test: [4500/9263] eta: 10:29:47 loss: 0.3635 (0.9713) acc1: 87.5000 (78.0299) acc5: 100.0000 (92.2601) time: 5.0883 (0.1988 -- 45.4116) data: 4.8232 (0.0004 -- 45.1441) max mem: 2736 Test: [4510/9263] eta: 10:27:42 loss: 0.4875 (0.9712) acc1: 87.5000 (78.0232) acc5: 100.0000 (92.2606) time: 4.5622 (0.1988 -- 45.4116) data: 4.3024 (0.0002 -- 45.1441) max mem: 2736 Test: [4520/9263] eta: 10:26:07 loss: 0.9449 (0.9718) acc1: 62.5000 (78.0054) acc5: 100.0000 (92.2639) time: 5.0264 (0.1956 -- 34.2879) data: 4.7782 (0.0002 -- 33.9954) max mem: 2736 Test: [4530/9263] eta: 10:25:31 loss: 0.8519 (0.9710) acc1: 75.0000 (78.0236) acc5: 100.0000 (92.2754) time: 9.2248 (0.1768 -- 112.2758) data: 8.9004 (0.0001 -- 111.7937) max mem: 2736 Test: [4540/9263] eta: 10:23:28 loss: 0.2287 (0.9698) acc1: 100.0000 (78.0555) acc5: 100.0000 (92.2842) time: 7.8819 (0.1757 -- 112.2758) data: 7.4237 (0.0001 -- 111.7937) max mem: 2736 Test: [4550/9263] eta: 10:25:02 loss: 0.2287 (0.9689) acc1: 100.0000 (78.0653) acc5: 100.0000 (92.2984) time: 14.1884 (0.1757 -- 158.7735) data: 13.8075 (0.0001 -- 158.5454) max mem: 2736 Test: [4560/9263] eta: 10:23:18 loss: 0.6578 (0.9690) acc1: 75.0000 (78.0585) acc5: 100.0000 (92.3016) time: 15.1305 (0.1843 -- 158.7735) data: 14.9077 (0.0002 -- 158.5454) max mem: 2736

congee524 commented 1 year ago

你有没有发现你的 data time 有点异常

data: 14.9077 (0.0002 -- 158.5454)

这个意思是你这批次数据,读取视频的时候,最短 0.0002 秒,最慢 158 秒,平均 15 秒。你自己看看你读视频为什么卡住了吧

DragonWang-cell commented 1 year ago

嗯嗯嗯是的,我用一下午把每个模块都测了一下时间,代码从头到尾检查一遍,然后刚刚发布这个就去看了一下htop,然后发现是htop里面有个僵尸进程,nvidia-smi看不到就疏忽了,(太难了),然后我又跑了一下final_test就正常了(害)