amazon-science / tubelet-transformer

This is an official implementation of TubeR: Tubelet Transformer for Video Action Detection
https://openaccess.thecvf.com/content/CVPR2022/supplemental/Zhao_TubeR_Tubelet_Transformer_CVPR_2022_supplemental.pdf
Apache License 2.0
71 stars 17 forks source link

Inference on AVA and JHMDB Needs Maintenance and Necessary Files #14

Open DanLuoNEU opened 1 year ago

DanLuoNEU commented 1 year ago

For the version I am using,

AVA2.1 inference needs several modifications:

  1. https://github.com/amazon-science/tubelet-transformer/blob/f610c97251e5539256095508570563ca2dc8c7a1/datasets/ava_frame.py#L135

    For function loadvideo, the function should be reading images with the video name. video_frame_list = sorted(glob(video_frame_path + vid + '/*.jpg'))

  2. Change the path here for the annotations. https://github.com/amazon-science/tubelet-transformer/blob/f610c97251e5539256095508570563ca2dc8c7a1/evaluates/evaluate_ava.py#L36

  3. The fixes above would get the number listed in the README table. But there would still be a tensorboard error "EOFerror". Add lines after https://github.com/amazon-science/tubelet-transformer/blob/f610c97251e5539256095508570563ca2dc8c7a1/eval_tuber_ava.py#L48

if cfg.DDP_CONFIG.GPU_WORLD_RANK == 0:
        writer.close()

AVA2.2 Inference

per_class [0.49119732        nan 0.32108856 0.58690862 0.1453127  0.25250868
 0.05269343 0.55119903 0.47336599 0.58118356 0.83511073 0.85809156
 0.4264426  0.79215918 0.7533182         nan 0.61339698        nan
        nan 0.04726829        nan 0.16529978        nan 0.23965087
        nan 0.04494236 0.306021   0.55275175 0.36725148 0.07057226
        nan        nan        nan 0.12159738        nan 0.03173127
 0.02196539 0.2641557         nan        nan 0.67544085        nan
 0.00367732        nan 0.01473403 0.03833153 0.03002702 0.37160171
 0.53368705        nan 0.21649021 0.1374056         nan 0.29578147
        nan 0.03978733 0.10253565 0.03219929 0.33915299 0.01752664
 0.28362901 0.3223239  0.14873739 0.52285939 0.14770317 0.11950478
 0.44886859 0.17733113 0.06789831 0.27917222        nan 0.46795067
 0.06238106 0.71983267        nan 0.05018591 0.31590126 0.09531384
 0.8376019  0.70844574]
{'PascalBoxes_Precision/mAP@0.5IOU': 0.30985340450933535, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/bend/bow (at the waist)': 0.4911973183134509, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/crouch/kneel': 0.3210885611841083, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/dance': 0.5869086163647963, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/fall down': 0.14531270272554303, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/get up': 0.25250867821227696, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/jump/leap': 0.05269343043207558, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/lie/sleep': 0.5511990313327797, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/martial art': 0.47336599427812304, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/run/jog': 0.5811835550049768, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/sit': 0.8351107282724392, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/stand': 0.8580915605931295, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/swim': 0.42644259946642094, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/walk': 0.7921591772441756, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/answer phone': 0.7533181965878357, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/carry/hold (an object)': 0.613396976906247, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/climb (e.g., a mountain)': 0.047268291513739374, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/close (e.g., a door, a box)': 0.16529978105316412, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/cut': 0.239650870599096, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/dress/put on clothing': 0.04494235744272522, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/drink': 0.30602100382076136, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/drive (e.g., a car, a truck)': 0.5527517520577403, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/eat': 0.3672514840844659, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/enter': 0.07057225556756908, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/hit (an object)': 0.12159737681929804, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/lift/pick up': 0.03173127096825363, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/listen (e.g., to music)': 0.021965385905557883, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/open (e.g., a window, a car door)': 0.2641556990694153, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/play musical instrument': 0.6754408509957595, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/point to (an object)': 0.0036773150722066972, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/pull (an object)': 0.01473402768023624, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/push (an object)': 0.038331529680086275, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/put down': 0.03002701544153771, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/read': 0.3716017145811048, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/ride (e.g., a bike, a car, a horse)': 0.5336870531261757, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/sail boat': 0.21649020512834088, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/shoot': 0.13740559748226708, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/smoke': 0.2957814682780021, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/take a photo': 0.03978732762876234, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/text on/look at a cellphone': 0.10253564997258985, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/throw': 0.03219929211064902, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/touch (an object)': 0.33915299353156436, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/turn (e.g., a screwdriver)': 0.017526643108955034, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/watch (e.g., TV)': 0.28362901476702795, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/work on a computer': 0.322323903124391, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/write': 0.1487373880589133, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/fight/hit (a person)': 0.5228593870747025, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/give/serve (an object) to (a person)': 0.14770317484649234, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/grab (a person)': 0.11950477963584528, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/hand clap': 0.44886858836133026, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/hand shake': 0.17733112595251085, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/hand wave': 0.06789830556787521, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/hug (a person)': 0.27917221591712854, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/kiss (a person)': 0.4679506698404774, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/lift (a person)': 0.062381058259554645, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/listen to (a person)': 0.7198326661128859, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/push (another person)': 0.050185914377705816, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/sing to (e.g., self, a person, a group)': 0.31590125934914154, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/take (an object) from (a person)': 0.09531383956904724, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/talk to (e.g., self, a person, a group)': 0.8376018955287321, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/watch (a person)': 0.7084457445779531}
mAP: 0.30985
DanLuoNEU commented 1 year ago

For JHMDB Inference

modify the loading detr part according to the built model embed_query input dimensions to avoid this problem https://github.com/amazon-science/tubelet-transformer/blob/f610c97251e5539256095508570563ca2dc8c7a1/utils/model_utils.py#L25

pretrained_dict.update({k: v[:query_size]})
if query_size == model.module.query_embed.weight.shape[0]: continue 
if v.shape[0] < model.module.query_embed.weight.shape[0]: # In case the pretrained model does not align
  query_embed_zeros=torch.zeros(model.module.query_embed.weight.shape)
  pretrained_dict.update({k: query_embed_zeros})
else:
  pretrained_dict.update({k: v[:model.module.query_embed.weight.shape[0]]})

Got different mAP as the table shows

per_class [0.96529908 0.4870422  0.81740977 0.64671594 0.99981187 0.48678173
 0.72522214 0.70157535 0.99132313 0.99332738 0.92539198 0.63780982
 0.6607778  0.89695387 0.78694818 0.42965094 0.26324953 0.94429166
 0.27346689 0.68134081 0.87238637        nan        nan        nan]
{'PascalBoxes_Precision/mAP@0.5IOU': 0.7231798302410739, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/Basketball': 0.9652990848728149, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/BasketballDunk': 0.4870421987013735, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/Biking': 0.8174097664543525, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/CliffDiving': 0.6467159401389935, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/CricketBowling': 0.9998118686054533, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/Diving': 0.48678173366600064, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/Fencing': 0.7252221388068574, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/FloorGymnastics': 0.7015753486207187, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/GolfSwing': 0.9913231289322941, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/HorseRiding': 0.9933273801597415, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/IceDancing': 0.9253919821730238, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/LongJump': 0.637809816668955, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/PoleVault': 0.6607777957457814, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/RopeClimbing': 0.8969538737505489, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/SalsaSpin': 0.7869481765834933, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/SkateBoarding': 0.42965094009542815, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/Skiing': 0.26324952994810963, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/Skijet': 0.9442916605769802, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/SoccerJuggling': 0.27346688938240526, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/Surfing': 0.681340807090747, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/TennisSwing': 0.8723863740884812, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/TrampolineJumping': nan, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/VolleyballSpiking': nan, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/WalkingWithDog': nan}
mAP: 0.72318
CKK-coder commented 1 year ago

For JHMDB Inference

modify the loading detr part according to the built model embed_query input dimensions to avoid this problem

https://github.com/amazon-science/tubelet-transformer/blob/f610c97251e5539256095508570563ca2dc8c7a1/utils/model_utils.py#L25

pretrained_dict.update({k: v[:query_size]})
if query_size == model.module.query_embed.weight.shape[0]: continue 
if v.shape[0] < model.module.query_embed.weight.shape[0]: # In case the pretrained model does not align
  query_embed_zeros=torch.zeros(model.module.query_embed.weight.shape)
  pretrained_dict.update({k: query_embed_zeros})
else:
  pretrained_dict.update({k: v[:model.module.query_embed.weight.shape[0]]})

Got different mAP as the table shows

per_class [0.96529908 0.4870422  0.81740977 0.64671594 0.99981187 0.48678173
 0.72522214 0.70157535 0.99132313 0.99332738 0.92539198 0.63780982
 0.6607778  0.89695387 0.78694818 0.42965094 0.26324953 0.94429166
 0.27346689 0.68134081 0.87238637        nan        nan        nan]
{'PascalBoxes_Precision/mAP@0.5IOU': 0.7231798302410739, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/Basketball': 0.9652990848728149, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/BasketballDunk': 0.4870421987013735, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/Biking': 0.8174097664543525, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/CliffDiving': 0.6467159401389935, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/CricketBowling': 0.9998118686054533, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/Diving': 0.48678173366600064, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/Fencing': 0.7252221388068574, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/FloorGymnastics': 0.7015753486207187, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/GolfSwing': 0.9913231289322941, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/HorseRiding': 0.9933273801597415, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/IceDancing': 0.9253919821730238, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/LongJump': 0.637809816668955, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/PoleVault': 0.6607777957457814, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/RopeClimbing': 0.8969538737505489, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/SalsaSpin': 0.7869481765834933, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/SkateBoarding': 0.42965094009542815, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/Skiing': 0.26324952994810963, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/Skijet': 0.9442916605769802, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/SoccerJuggling': 0.27346688938240526, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/Surfing': 0.681340807090747, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/TennisSwing': 0.8723863740884812, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/TrampolineJumping': nan, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/VolleyballSpiking': nan, 'PascalBoxes_PerformanceByCategory/AP@0.5IOU/WalkingWithDog': nan}
mAP: 0.72318

Thank you for your correction.Do you find any code about video map inference. I want to reproduce the video map of UCF101-24.

FransHk commented 1 month ago

Thanks for taking your time to write this, helped me greatly. It's a shame that the codebase for this model is such a mess as-is.