CAMMA-public / SelfPose3d

Official code for "SelfPose3d: Self-Supervised Multi-Person Multi-View 3d Pose Estimation"
Other
30 stars 3 forks source link

Reproducing pose estimation results on subset of panoptic datasets #7

Open LadnerJonas opened 1 month ago

LadnerJonas commented 1 month ago

Currently, I am trying to apply this framework to our 4D-OR dataset (https://github.com/egeozsoy/4D-OR, TU Munich Germany). After setting up the corresponding dataset files and adapting the projection logic (due to our camera calibration), we are having trouble getting the posenet training to improve the pose detection. As it is harder to estimate poses on our dataset, we tried to reproduce the great pose estimation results on a subset of the panoptic datasets.

I freshly cloned this repository, adapted the paths, and otherwise only changed the selected training and val datasets to: TRAIN_LIST = [ "160224_haggling1"], VAL_LIST = [ "160906_pizza1"]

We used the default configuration (without a pre-trained backbone) for the backbone, root- and posenet training steps. We did not train using the optional fine-tuning.

Unfortunately, the results with this provided configuration were not as good as expected. The human root joints are detected fairly accurately, but the pose estimation training does not seems to work as expected. After the full training, the debug image still look like this: last epoch, train_2300_3d.png / ~2500 pictures train_2300_3d

heatmap: train_00002300_view_2_hm_pred.png train_00002300_view_2_hm_pred

gt: train_00002300_view_2_gt.jpg train_00002300_view_2_gt

@keqizero In case you need more information, I am happy to provide it. Thank you!

keqizero commented 1 month ago

Hi, thank you for the subset experiment.

Personally, I have never tried training with only one video. To do a quick experiment, I used the pretrained backbone and root net model (as provided in this repo) to train a pose net model from scratch, using only the pseudo 2d poses of "160224_haggling1" sequence, and then evaluate on 4 videos. I attached the training log of the first 3 epochs along with the visualization examples. The result looks okay.

validation_002_00000000_3d validation_002_00000322_3d

cam5_posenet_2024-10-01-12-03_train.log

Could you compare my training log with yours, to see what may cause the problem? Hope it can help.

LadnerJonas commented 1 month ago

Thank you for your response.

Our configuration (printed at the start) is exactly the same, beside batch_size / GPU count. Here is my training log: training-log.txt

It seems like the loss is decreasing way slower and also does not improve after epoch 0.

Please double-check if the GitHub repository code is up-to-date with local changes.

Can you also share your heatmaps and attention maps (using the evaluation)? As can be seen above, it only detects the root joints using the repository code.

Do I have to do anything besides adapting the selected training/evaldataset? To my understanding, as the 2d pseudo poses are already stored and read in the two panoptic dataset files (lib/dataset/panoptic(ssv).py, nothing else has to be done?

keqizero commented 1 month ago

Thank you for your response.

Our configuration (printed at the start) is exactly the same, beside batch_size / GPU count. Here is my training log: training-log.txt

It seems like the loss is decreasing way slower and also does not improve after epoch 0.

Please double-check if the GitHub repository code is up-to-date with local changes.

Can you also share your heatmaps and attention maps (using the evaluation)? As can be seen above, it only detects the root joints using the repository code.

Do I have to do anything besides adapting the selected training/evaldataset? To my understanding, as the 2d pseudo poses are already stored and read in the two panoptic dataset files (lib/dataset/panoptic(ssv).py, nothing else has to be done?

I think the root cause is the underfitting of the backbone.

I looked at your log. I saw that the root net's performance is surprisingly low, thus I did a quick experiment by training root net for 1 epoch with only one video, and I obtained much better result as attached. It could explain why your pose net won't even converge, as the backbone is not well-trained to detect joints. cam5_rootnet_2024-10-01-16-45_train.log

I would suggest that you replace your backbone with mine, and then train the root net and pose net again, to see if you can have similar performance.

FYI, the repo code is up-to-date. My heatmaps are attached as below. For subset training, you don't need to do anything else, except to filter out the other 8 videos (which is what I did).
validation_00000200_view_1_hm_pred

LadnerJonas commented 1 month ago

Thank you for your response. In the meantime, I was able to reproduce it with comparable results.

LadnerJonas commented 1 month ago

My intial problem, why I even wanted to reproduce the estimation results on the panoptic data, was caused be these lines: https://github.com/CAMMA-public/SelfPose3d/blob/7ec0ba67ad860640dd26d76529996bac1b4eda0e/lib/models/cuboid_proposal_net_soft.py#L103-L105 , which were incompatible with our own dataset/camera configuration.

keqizero commented 1 month ago

My intial problem, why I even wanted to reproduce the estimation results on the panoptic data, was caused be these lines:

https://github.com/CAMMA-public/SelfPose3d/blob/7ec0ba67ad860640dd26d76529996bac1b4eda0e/lib/models/cuboid_proposal_net_soft.py#L103-L105

, which were incompatible with our own dataset/camera configuration.

Yes, you are right. These parameters are for the Panoptic dataset, similar to 3D space size.