MiracleDance / PoseRAC

PoseRAC: Pose Saliency Transformer for Repetitive Action Counting
MIT License
13 stars 2 forks source link

Difference between `pose_train.csv` vs `video_train.csv` in annotation directory #7

Closed itzsid closed 1 year ago

itzsid commented 1 year ago

Hi,

I'm trying to understand the dataset and I'm wondering what is the difference between pose_train.csv and video_train.csv. pose_train.csv has 487 samples as compared to 758 samples in video_train.csv. Also, pose_train has these categories:

['front_raise', 'pull_up', 'squat', 'bench_pressing', 'jump_jack',
       'situp', 'push_up', 'pommelhorse']

video_train has these categories:

['frontraise', 'pullups', 'squant', 'front_raise', 'bench_pressing',
       'jump_jack', 'situp', 'benchpressing', 'squat', 'pull_up',
       'push_up', 'jumpjacks', 'pushups', 'others', 'battle_rope',
       'pommelhorse']

So, I wonder how the training subset in pose_train selected?

MiracleDance commented 1 year ago

For problem 1: pose_train.csv has 487 samples as compared to 758 samples in video_train.csv

I explained it in my paper, in Sec 3: This is because the pose-level method does not need to predict the number of repetitions during training, but only completes the mapping between salient poses and actions, so we do not need to capture every action event in the training set, but choose high-quality actions. In this regard, the cost of annotation will also be less than video-level methods.

To sum up, it is also ok to annotate all the actions that appear in the video, but it is not necessary.

We only use the keyframes where the salient poses are located for training, so even using only a part of the actions is enough to train the network to learn this mapping. (For example, a video has 10 actions, but maybe only 6 actions are enough for training)

In the testing stage, we will of course use all actions from the test set for fair comparison.

For problem 2: different categories

Strictly speaking, the categories of pose_train.csv and video_train.csv are the same. That is to say, front_raise and frontraise are the same, jump_jack and jump_jacks are the same, pull_up and pullups are the same, etc.

The above is the explanation of the authors of the RepCount dataset. When they annotated, there were many people working together, so some distinctions were made in the category naming.

When I annotated pose_train.csv, I just merged these categories.

itzsid commented 1 year ago

Thanks for the quick reply @MiracleDance. How do you deal with others category which is part of test?

MiracleDance commented 1 year ago

In the test set, only one sample is the others category, which is not representative, so it can be removed directly. Or just keep it, it would not be recognized as one of the eight categories we are dealing with.

In addition, there are samples belonging to the category battle_rope in the training set, but there are no samples of this category in the test set. Maybe these are some small flaws of the RepCount dataset, even though this dataset is still very good. Therefore, the re-annotating of pose_train.csv still has to be based on the overall information of the dataset.

This task has not been extensively explored, maybe a larger, more comprehensive and complete dataset would be better.

In general, for PoseRAC to solve the task of RepCount, the categories to be processed are:

front_raise, pull_up, push_up, jump_jack, pommelhorse, squat, situp, bench_pressing 
itzsid commented 1 year ago

This makes sense. Thanks @MiracleDance.

itzsid commented 1 year ago

Apologies for re-opening this ticket. I have another question regarding the difference between pose_train.csv and video_train.csv: If I understand correctly, L1, L2.... refer to the start and end locations. Is that correct? For the same file (example show below), the values of L1, L2 are different. Why is there a difference?

In pose_train.csv:

175,situp,stu1_64.mp4,200,236,301,333,356,392,426,456,488,531,577,605,636,666,698,738,777,811,850,889,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

In video_train.csv:

254,situp,stu1_64.mp4,9,278,321,321,390,390,460,460,528,528,599,599,666,666,734,734,807,807,886,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
MiracleDance commented 1 year ago

Oh......the annotations of pose_train.csv and video_train.csv mean different locations.

For specific concepts, maybe you can refer to my paper.

I think Figure 2 in my paper can already accurately answer your question. If you have further questions, please leave a message!

Here is the Figure 2:

image

itzsid commented 1 year ago

Got it, thanks!