dominickrei / pi-vit

[CVPR 2024] Code and models for pi-ViT, a video transformer for understanding activities of daily living
Other
14 stars 2 forks source link

Preparing CSVs #2

Open emucllari1 opened 2 months ago

emucllari1 commented 2 months ago

Question at Preparing CSVs files (part from README):

  1. Are path_to_video_1 and path_to_video_1_skeleton related to the Smarthome dataset root directory? Should path_to_video1 be formatted as Smarthome/mp4/[mp4_file_name]? Should path_to_video_1_skeleton be formatted as Smarthome/skeletonv12/[json_or_h5_file_name]?
  2. How are label_1 and subsequent labels determined? Is there a mapping file or a specific naming convention for these labels like the name at every .mp4 or .json file like Cook.Cleandishes_p02_r00_v02_c03 and so on?

Thank you for your help!

dominickrei commented 2 months ago

Hi @emucllari1, thank you for the interest!

Are path_to_video_1 and path_to_video_1_skeleton related to the Smarthome dataset root directory? Should path_to_video1 be formatted as Smarthome/mp4/[mp4_file_name]? Should path_to_video_1_skeleton be formatted as Smarthome/skeletonv12/[json_or_h5_file_name]?

The video paths and skeleton paths should point to the mp4 and skeleton files provided in the Toyota Smarthome Trimmed dataset. E.g., path_to_video_1 = /path/to/smarthome/mp4/WatchTV_p25_r15_v01_c04.mp4 and correspondingly path_to_video_1_skeleton = /path/to/smarthome/skeletons/WatchTV_p25_r15_v01_c04_pose3d.json. The csv does not need to contain the h5 files.

How are label_1 and subsequent labels determined? Is there a mapping file or a specific naming convention for these labels like the name at every .mp4 or .json file like Cook.Cleandishes_p02_r00_v02_c03 and so on?

These labels are determined by the classes in the Toyota Smarthome Trimmed dataset. You can generate these by assigning a unique integer label to each class in the dataset. For convenience I provide the mappings I use below.

cs_label_mappings.csv cv_label_mappings.csv

emucllari1 commented 2 months ago

Hi @dominickrei, thank you for your assistance! I understand the idea at the cv_label_mappings.csv but there are some classes that I have in the data that are not in the .csv file like Cook.Cleandishes or like Maketea.Boilwater. Is it correct if I can just get this labels at cv_label_mappings.csv just from the data file and then just to have 70% train.csv, 15% test.csv and 15% val.csv? Also can you check the cs_label_mappings.csv since it is saying "Not found"?

dominickrei commented 2 months ago

Sorry about that, I updated my previous comment and fixed the download.

The reason some labels are missing from cv_label_mappings.csv is because in the cross-view split of the data does not include all of the classes (some classes do not occur in the views used in the cross-view splits). If you check cs_label_mappings.csv all of the classes will be contained.

As for splitting the data, you should follow the train/val/test splits proposed in the Toyota Smarthome: Real-World Activities of Daily Living paper (see Section 3.1).

emucllari1 commented 2 months ago

Thank you!

emucllari1 commented 2 months ago

Hi @dominickrei , thank you for your help! I just needed to clarify. At the paper "Toyota Smarthome: Real-World Activities of Daily Living" at section 3.1, it says that "Cross-subject evaluation In cross-subject (CS) evaluation, we split the 18 subjects into training and testing groups. In order to balance the number of videos for each category of activity in both training and testing, the training group consists of 11 subjects with IDs: 3, 4, 6, 7, 9, 12, 13, 15, 17, 19, 25. The remaining 7 subjects are reserved for testing.", Does this mean that when working with Toyota Smarthome and cross-subject evaluation it is only using train and test data and not valid data? So, say at Cook.Cleandishes_p02_r00_v02_c03.json, p02 is showing subject with ID 2? Am I correct?

dominickrei commented 2 months ago

Hi @emucllari1, for validation 10% of the training set was randomly sampled and used as validation (see Section 5.2 of the Toyota Smarthome paper). Also you are correct that p02 identifies the subject with ID 2.

emucllari1 commented 2 months ago

Thank you!

basavaraj-hampiholi commented 1 month ago

Dear Dominick,

I have couple queries.

  1. For train/val set: Have you followed the https://github.com/srijandas07/i3d_smarthome/blob/main/makecsv.py to segment a video sample of an mp4 file into multiple clips. For example, Cleandishes_p02_r00_v16_c06 video contains multiple instances like below:

Cook.Cleandishes_p02_r00_v16_c06,0,128 Cook.Cleandishes_p02_r00_v16_c06,128,256 Cook.Cleandishes_p02_r00_v16_c06,256,384 Cook.Cleandishes_p02_r00_v16_c06,384,512 Cook.Cleandishes_p02_r00_v16_c06,512,640 Cook.Cleandishes_p02_r00_v16_c06,640,768

  1. Human region cropping: I am using SSD detector from pytorch (https://pytorch.org/hub/nvidia_deeplearningexamples_ssd/). The detection results are not so impressive. Could you please let me know which code (like some git repo) you referred for detection.

Thanks in advance in anticipation of your response

Best, Basavaraj

dominickrei commented 1 month ago

Hi @basavaraj-hampiholi, thank you for your interest, to answer your questions:

  1. We do not use this code to segment the videos of Toyota-Smarthome. If you download the trimmed version of the dataset the videos will already be segmented and you do not need to process them
  2. We suggest you use the newer YOLO models to do the cropping, here is an example using YOLOv8: https://dev.to/irubtsov/object-tracking-and-video-cropping-with-computer-vision-and-machine-learning-3ge2
basavaraj-hampiholi commented 3 weeks ago

Dear @dominickrei ,

Thank you very much for your response.

  1. May I know if have you used test labels from https://github.com/srijandas07/i3d_smarthome/blob/main/labels/test_Labels_CS.csv? If we use this instead of https://github.com/srijandas07/i3d_smarthome/blob/main/splits/test_CS.txt, then the test accuracy differs. It would be helpful if you clarify this.

Best Regards, Basavaraj

dominickrei commented 3 weeks ago

Hi @basavaraj-hampiholi, we did not take the test labels from either of those sources and instead generated the train/test labels using our own code (I have uploaded it here). This should generate something similar to the second file you shared.

The first file seems to have repetition of the samples (e.g., Cook.Cleandishes_p02_r00_v16_c06 repeats from lines 5-16) which is likely an artifact of creating the trimmed dataset from the raw videos. This is the reason for the discrepancy in testing performance, since some samples are contributing more to the accuracy than others.