facebookresearch / jepa

PyTorch code and models for V-JEPA self-supervised learning from video.
Other
2.61k stars 249 forks source link

How to reproduce video recognition Acc in the Table? #18

Closed ypflll closed 6 months ago

ypflll commented 6 months ago

Thank you very much to share your great work! I tried to reproduce the video recognition results but get very low accuracy. Can you give me some advices if I missed something? or kindly provide a script which can get Acc in the Table?

I tested the model based on this script: jepa/evals/video_classification_frozen/eval.py, and removed code related to training. Model & config: Encoder: vith16.pth.tar Classifier: vith16-k400-probe.pth.tar Config: vith16_k400_16x8x3 Example data, first 5 videos of k400 val-set, label is "abseiling": 0wR5jVB-WPk_000417_000427.mp4 3caPS4FHFF8_000036_000046.mp4 3yaoNwz99xM_000062_000072.mp4 6IbvOJxXnOo_000047_000057.mp4 6_4kjPiQr7w_000191_000201.mp4 Resut: Index: 0 , Predict: 198 Index: 1 , Predict: 198 Index: 2 , Predict: 198 Index: 3 , Predict: 211 Index: 4 , Predict: 198

Label "abseiling" should be 0, accoring to willprice/KINETICS_LABELS.md So the predictions are all wrong?

ypflll commented 6 months ago

I did more test. It seems that the label ID definition is different form willprice/KINETICS_LABELS.md. For example, I tested on five videos which are labeled as 'zumba' (ID 399), the model predictions are: 381 381 381 310 381. ID 381 is "washing hair"

MidoAssran commented 6 months ago

Yes these class label definitions are different. I believe these are the correct definitions for the probe:

K400_CLASS_TEMPLATES = [
    '0 weaving_basket',
    '1 playing_drums',
    '2 catching_or_throwing_softball',
    '3 riding_unicycle',
    '4 robot_dancing',
    '5 eating_cake',
    '6 cleaning_toilet',
    '7 biking_through_snow',
    '8 bee_keeping',
    '9 playing_keyboard',
    '10 skiing_slalom',
    '11 balloon_blowing',
    '12 feeding_birds',
    '13 trimming_or_shaving_beard',
    '14 playing_trombone',
    '15 parasailing',
    '16 exercising_with_an_exercise_ball',
    '17 massaging_feet',
    '18 bending_back',
    '19 smoking_hookah',
    '20 salsa_dancing',
    '21 hopscotch',
    '22 windsurfing',
    '23 testifying',
    '24 washing_feet',
    '25 playing_clarinet',
    '26 golf_putting',
    '27 washing_hair',
    '28 swimming_breast_stroke',
    '29 pushing_car',
    '30 riding_mountain_bike',
    '31 playing_chess',
    '32 vault',
    '33 cooking_on_campfire',
    '34 catching_fish',
    '35 fixing_hair',
    '36 texting',
    '37 skipping_rope',
    '38 changing_oil',
    '39 brushing_teeth',
    '40 pushing_cart',
    '41 eating_watermelon',
    '42 kicking_field_goal',
    '43 playing_poker',
    '44 training_dog',
    '45 making_jewelry',
    '46 springboard_diving',
    '47 playing_bass_guitar',
    '48 cutting_nails',
    '49 making_bed',
    '50 driving_car',
    '51 catching_or_throwing_frisbee',
    '52 petting_cat',
    '53 cleaning_pool',
    '54 tossing_salad',
    '55 sled_dog_racing',
    '56 cleaning_gutters',
    '57 slapping',
    '58 swing_dancing',
    '59 making_a_sandwich',
    '60 taking_a_shower',
    '61 cleaning_shoes',
    '62 digging',
    '63 eating_carrots',
    '64 hitting_baseball',
    '65 using_computer',
    '66 playing_didgeridoo',
    '67 surfing_water',
    '68 headbutting',
    '69 getting_a_tattoo',
    '70 juggling_fire',
    '71 tobogganing',
    '72 playing_saxophone',
    '73 beatboxing',
    '74 tickling',
    '75 shredding_paper',
    '76 drop_kicking',
    '77 riding_a_bike',
    '78 triple_jump',
    '79 cheerleading',
    '80 eating_spaghetti',
    '81 mopping_floor',
    '82 scuba_diving',
    '83 capoeira',
    '84 swimming_butterfly_stroke',
    '85 using_remote_controller_(not_gaming)',
    '86 throwing_ball',
    '87 riding_mule',
    '88 feeding_fish',
    '89 dying_hair',
    '90 grooming_dog',
    '91 kissing',
    '92 snowboarding',
    '93 hurling_(sport)',
    '94 juggling_balls',
    '95 hula_hooping',
    '96 snorkeling',
    '97 playing_squash_or_racquetball',
    '98 filling_eyebrows',
    '99 arranging_flowers',
    '100 sanding_floor',
    '101 playing_cello',
    '102 sweeping_floor',
    '103 waiting_in_line',
    '104 feeding_goats',
    '105 dribbling_basketball',
    '106 tying_tie',
    '107 assembling_computer',
    '108 headbanging',
    '109 doing_laundry',
    '110 snowmobiling',
    '111 hugging',
    '112 running_on_treadmill',
    '113 tasting_beer',
    '114 spraying',
    '115 playing_harmonica',
    '116 petting_animal_(not_cat)',
    '117 slacklining',
    '118 pumping_fist',
    '119 watering_plants',
    '120 push_up',
    '121 massaging_legs',
    '122 making_pizza',
    '123 cleaning_floor',
    '124 marching',
    '125 peeling_potatoes',
    '126 unloading_truck',
    '127 climbing_a_rope',
    '128 mowing_lawn',
    '129 climbing_tree',
    '130 counting_money',
    '131 busking',
    '132 making_sushi',
    '133 eating_chips',
    '134 making_a_cake',
    '135 playing_trumpet',
    '136 rock_scissors_paper',
    '137 flying_kite',
    '138 giving_or_receiving_award',
    '139 high_kick',
    '140 cracking_neck',
    '141 waxing_chest',
    '142 ice_skating',
    '143 singing',
    '144 doing_nails',
    '145 bowling',
    '146 faceplanting',
    '147 skiing_crosscountry',
    '148 cutting_watermelon',
    '149 playing_recorder',
    '150 cleaning_windows',
    '151 answering_questions',
    '152 stretching_leg',
    '153 shearing_sheep',
    '154 breading_or_breadcrumbing',
    '155 massaging_back',
    '156 planting_trees',
    '157 bartending',
    '158 stomping_grapes',
    '159 gargling',
    '160 folding_napkins',
    '161 breakdancing',
    '162 bench_pressing',
    '163 situp',
    '164 celebrating',
    '165 playing_cricket',
    '166 auctioning',
    '167 squat',
    '168 bandaging',
    '169 writing',
    '170 dancing_macarena',
    '171 dining',
    '172 playing_volleyball',
    '173 archery',
    '174 hoverboarding',
    '175 ice_fishing',
    '176 bending_metal',
    '177 playing_paintball',
    '178 parkour',
    '179 tasting_food',
    '180 swinging_legs',
    '181 riding_scooter',
    '182 canoeing_or_kayaking',
    '183 welding',
    '184 applying_cream',
    '185 yoga',
    '186 throwing_axe',
    '187 eating_burger',
    '188 frying_vegetables',
    '189 playing_ice_hockey',
    '190 opening_bottle',
    '191 skateboarding',
    '192 dancing_charleston',
    '193 cartwheeling',
    '194 ironing',
    '195 deadlifting',
    '196 blowing_out_candles',
    '197 playing_cymbals',
    '198 abseiling',
    '199 bookbinding',
    '200 throwing_discus',
    '201 sticking_tongue_out',
    '202 water_sliding',
    '203 eating_ice_cream',
    '204 grinding_meat',
    '205 blasting_sand',
    '206 making_snowman',
    '207 making_tea',
    '208 finger_snapping',
    '209 wrestling',
    '210 snowkiting',
    '211 rock_climbing',
    '212 dunking_basketball',
    '213 tapping_pen',
    '214 shaking_head',
    '215 peeling_apples',
    '216 holding_snake',
    '217 playing_bagpipes',
    '218 eating_doughnuts',
    '219 smoking',
    '220 washing_hands',
    '221 curling_hair',
    '222 shoveling_snow',
    '223 playing_organ',
    '224 waxing_eyebrows',
    '225 checking_tires',
    '226 bouncing_on_trampoline',
    '227 clapping',
    '228 chopping_wood',
    '229 tying_knot_(not_on_a_tie)',
    '230 surfing_crowd',
    '231 tying_bow_tie',
    '232 sharpening_knives',
    '233 tapping_guitar',
    '234 driving_tractor',
    '235 playing_kickball',
    '236 strumming_guitar',
    '237 riding_camel',
    '238 kicking_soccer_ball',
    '239 playing_cards',
    '240 blowing_nose',
    '241 juggling_soccer_ball',
    '242 presenting_weather_forecast',
    '243 whistling',
    '244 punching_person_(boxing)',
    '245 braiding_hair',
    '246 dancing_gangnam_style',
    '247 clay_pottery_making',
    '248 baking_cookies',
    '249 pull_ups',
    '250 building_shed',
    '251 moving_furniture',
    '252 playing_monopoly',
    '253 drinking_shots',
    '254 egg_hunting',
    '255 jumpstyle_dancing',
    '256 contact_juggling',
    '257 milking_cow',
    '258 barbequing',
    '259 tai_chi',
    '260 building_cabinet',
    '261 playing_xylophone',
    '262 blowing_glass',
    '263 climbing_ladder',
    '264 drumming_fingers',
    '265 paragliding',
    '266 shooting_goal_(soccer)',
    '267 changing_wheel',
    '268 brush_painting',
    '269 playing_tennis',
    '270 arm_wrestling',
    '271 using_segway',
    '272 decorating_the_christmas_tree',
    '273 sign_language_interpreting',
    '274 roller_skating',
    '275 playing_basketball',
    '276 news_anchoring',
    '277 cooking_sausages',
    '278 cutting_pineapple',
    '279 pumping_gas',
    '280 pushing_wheelchair',
    '281 extinguishing_fire',
    '282 water_skiing',
    '283 bobsledding',
    '284 sneezing',
    '285 lunge',
    '286 walking_the_dog',
    '287 swimming_backstroke',
    '288 shaving_legs',
    '289 shining_shoes',
    '290 tossing_coin',
    '291 sniffing',
    '292 hurdling',
    '293 setting_table',
    '294 jogging',
    '295 swinging_on_something',
    '296 javelin_throw',
    '297 high_jump',
    '298 golf_chipping',
    '299 reading_newspaper',
    '300 somersaulting',
    '301 tap_dancing',
    '302 unboxing',
    '303 flipping_pancake',
    '304 sailing',
    '305 doing_aerobics',
    '306 playing_flute',
    '307 belly_dancing',
    '308 dodgeball',
    '309 laughing',
    '310 krumping',
    '311 skydiving',
    '312 playing_guitar',
    '313 sharpening_pencil',
    '314 wrapping_present',
    '315 carving_pumpkin',
    '316 clean_and_jerk',
    '317 side_kick',
    '318 hammer_throw',
    '319 golf_driving',
    '320 folding_clothes',
    '321 crawling_baby',
    '322 passing_American_football_(not_in_game)',
    '323 bungee_jumping',
    '324 riding_mechanical_bull',
    '325 air_drumming',
    '326 reading_book',
    '327 massaging_persons_head',
    '328 drinking_beer',
    '329 scrambling_eggs',
    '330 folding_paper',
    '331 playing_controller',
    '332 hockey_stop',
    '333 getting_a_haircut',
    '334 riding_elephant',
    '335 front_raises',
    '336 pole_vault',
    '337 crossing_river',
    '338 picking_fruit',
    '339 blowing_leaves',
    '340 gymnastics_tumbling',
    '341 shuffling_cards',
    '342 eating_hotdog',
    '343 crying',
    '344 jetskiing',
    '345 diving_cliff',
    '346 laying_bricks',
    '347 ski_jumping',
    '348 drinking',
    '349 riding_or_walking_with_horse',
    '350 passing_American_football_(in_game)',
    '351 skiing_(not_slalom_or_crosscountry)',
    '352 playing_badminton',
    '353 trimming_trees',
    '354 exercising_arm',
    '355 yawning',
    '356 cooking_egg',
    '357 kitesurfing',
    '358 washing_dishes',
    '359 shot_put',
    '360 garbage_collecting',
    '361 grooming_horse',
    '362 playing_harp',
    '363 jumping_into_pool',
    '364 drawing',
    '365 dancing_ballet',
    '366 shaving_head',
    '367 opening_present',
    '368 catching_or_throwing_baseball',
    '369 recording_music',
    '370 spray_painting',
    '371 knitting',
    '372 stretching_arm',
    '373 snatch_weight_lifting',
    '374 carrying_baby',
    '375 playing_ukulele',
    '376 punching_bag',
    '377 shooting_basketball',
    '378 spinning_poi',
    '379 waxing_legs',
    '380 long_jump',
    '381 zumba',
    '382 playing_piano',
    '383 playing_accordion',
    '384 shaking_hands',
    '385 applauding',
    '386 motorcycling',
    '387 disc_golfing',
    '388 baby_waking_up',
    '389 trapezing',
    '390 plastering',
    '391 cooking_chicken',
    '392 tango_dancing',
    '393 brushing_hair',
    '394 waxing_back',
    '395 playing_violin',
    '396 ripping_paper',
    '397 country_line_dancing',
    '398 sword_fighting',
    '399 ice_climbing',
]
jez-moxmo commented 6 months ago

Thank you MidoAssran. The encoder dicts should be updated to include label references as I had the same issue with mismatched labelling. Inference is working perfectly now.

uniquezhengjie commented 6 months ago

@ypflll How did you modify “jepa/evals/video_classificationunfrozen/eval. py”? I used the same weight but produced incorrect prediction results

ypflll commented 6 months ago

@uniquezhengjie I'll share a script later

uniquezhengjie commented 6 months ago

@uniquezhengjie I'll share a script later

Thank you, looking forward to your reply

ypflll commented 6 months ago

pred.zip @uniquezhengjie

uniquezhengjie commented 6 months ago

pred.zip @uniquezhengjie

I tested using the script you provided and obtained the following results: Model & config:  Encoder: vith16-384.pth.tar  Classifier: vith16-384-k400-probe.pth.tar  Config: vith16_384-k400_16x8x3 Example data, first 5 videos of k400 val-set, label is "abseiling":  0wR5jVB-WPk_000417_000427.mp4  3caPS4FHFF8_000036_000046.mp4  3yaoNwz99xM_000062_000072.mp4  6IbvOJxXnOo_000047_000057.mp4  6_4kjPiQr7w_000191_000201.mp4 Resut:  Index: 0 , Predict: 257  Index: 1 , Predict: 268  Index: 2 , Predict: 288  Index: 3 , Predict: 11  Index: 4 , Predict: 153

ypflll commented 6 months ago

Resut: Index: 0 , Predict: 198 Index: 1 , Predict: 198 Index: 2 , Predict: 198 Index: 3 , Predict: 211 Index: 4 , Predict: 198

Seems that the results are random. I wonder if the pretrained models are correctly loaded.

yuh-zha commented 4 months ago

Hi @uniquezhengjie , have you figured out why the model outputs random predictions? I also got some results that do not make sense. Is there any trick to load the pretrained model and attentive probe?

yichi-yang commented 4 months ago

I also got random predictions with the code on main. Reverting https://github.com/facebookresearch/jepa/commit/787b04ae6c573be587d6afccea5eca6b9fde9039 fixed the issue for me. I'd guess this linear layer https://github.com/facebookresearch/jepa/blob/787b04ae6c573be587d6afccea5eca6b9fde9039/src/models/utils/modules.py#L137 was never used when the probe is trained.