Arthur151 / ROMP

Monocular, One-stage, Regression of Multiple 3D People and their 3D positions & trajectories in camera & global coordinates. ROMP[ICCV21], BEV[CVPR22], TRACE[CVPR2023]
https://www.yusun.work/
Apache License 2.0
1.35k stars 230 forks source link

Reproduce Result #121

Closed panshaohua closed 2 years ago

panshaohua commented 2 years ago

Hi. Thanks your work. Could you show me your training log? I can't reproduce paper's results. This is my log file and yaml file. I only change the batch-size to 48 because of my memory. Anything else is default. hrnet_cm64_V1_hrnet.log hrnet_cm64_V1_hrnet_yml.log

Arthur151 commented 2 years ago

V1_hrnet_nopretrain_h36m,mpiinf,coco,mpii,lsp,muco,crowdpose_g0,1,2,3.log Sorry for the late reply, I just get to work. Here is the log for 80 epochs. Welcome to discuss any problems you want with me.

Arthur151 commented 2 years ago

@panshaohua What is the number you get? I noticed that you didn't evaluate the results. The log above is trained from scratch. The results would be much easier to reproduce if you use the pre-train backbone that I provided.

panshaohua commented 2 years ago

Thank you for your prompt attention. I found my loss is smaller than you. Is something else wrong? V1_hrnet_h36m,mpiinf,coco,mpii,lsp,muco,crowdpose_g0,1,2,3.log

Arthur151 commented 2 years ago

It seems that your model gets the best results at Line 2733,

['Evaluation'] on local_rank 0  
|   DS/EM   | MPJPE | PA_MPJPE |  
| pw3d_vibe | 90.69 |  52.68   |  

This PA-MPJPE is even better than what I report in paper. The huge difference between MPJPE and PA-MPJPE shows the domain gap between training and evaluation dataset, such as different skeleton defination, different camera defination, etc. MPJPE could be easily effected by the domain gap.

And then the performance gets worse and worse while traning goes on. This may indicate that the model was not properly converged for training more epoches with batch size 48.

My log is for traning without any pre-training, in other word, from scratch. So the loss is relative big.

panshaohua commented 2 years ago

Thanks for your nice response. It's very helpful to me.

Arthur151 commented 2 years ago

So, 5 stars for my service, may I ? Glad that you like ROMP.

panshaohua commented 2 years ago

Haha. nice of you and your work!

Arthur151 commented 2 years ago

Come and discuss with me if you like ROMP. live long and prosper~ :)

ZhengdiYu commented 2 years ago

It seems that your model gets the best results at Line 2733,

['Evaluation'] on local_rank 0  
|   DS/EM   | MPJPE | PA_MPJPE |  
| pw3d_vibe | 90.69 |  52.68   |  

This PA-MPJPE is even better than what I report in paper. The huge difference between MPJPE and PA-MPJPE shows the domain gap between training and evaluation dataset, such as different skeleton defination, different camera defination, etc. MPJPE could be easily effected by the domain gap.

And then the performance gets worse and worse while traning goes on. This may indicate that the model was not properly converged for training more epoches with batch size 48.

My log is for traning without any pre-training, in other word, from scratch. So the loss is relative big.

Hi, I was wondering that if you meant that it was even without using 'trained_models/pretrain_hrnet.pkl'? I think 'trained_models/pretrain_hrnet.pkl' is pretrained on 2D pose estimation, instead of an initial weights from ImageNet, is this correct?

I have trained ones using v1.yml, in other word, using 'trained_models/pretrain_hrnet.pkl' and got similar results as yours. And it seems that I can't reproduce MPJPE in the paper, which is about 85, and it seems neither of yours reproduced it as well

Arthur151 commented 2 years ago

Both our logs didn't use the pretrain model. Please note that the best results is very likely not achieved by the model of the last epoch.

Yes, pretrained model is pretrained on 2D pose estimation and detection datasets.

To get the results in paper, you may need more epochs for training. My posted log just takes 80 epochs and didn't use the pretrain model. Please try to train more epochs, you will find the model performs way more better with more training. The converge is relative slow. Because differet from 2D heatmap learning that directly supervise the whole map, we just take a few points from parameter maps for supervision. Infulenced by some uncertainty in training, you may need more time.
Besides, please try to reduce the shuffle_crop_ratio_3d in v1.yml and add more middle scale augmentation if you can.

Feel free to reach to me, if you have any question. I may update this respository later to include more training tricks I used to get the best results faster. Sorry I didn't include all of them before. There are some differences between the released code and what I have been using. Besides, have you tried to fine-tune on 3DPW training set?

ZhengdiYu commented 2 years ago

Both our logs didn't use the pretrain model. Please note that the best results is very likely not achieved by the model of the last epoch.

Yes, pretrained model is pretrained on 2D pose estimation and detection datasets.

To get the results in paper, you may need more epochs for training. My posted log just takes 80 epochs and didn't use the pretrain model. Please try to train more epochs, you will find the model performs way more better with more training. The converge is relative slow. Because differet from 2D heatmap learning that directly supervise the whole map, we just take a few points from parameter maps for supervision. Infulenced by some uncertainty in training, you may need more time. Besides, please try to reduce the shuffle_crop_ratio_3d in v1.yml and add more middle scale augmentation if you can.

Feel free to reach to me, if you have any question. I may update this respository later to include more training tricks I used to get the best results faster. Sorry I didn't include all of them before.

Happy New Year! Thanks for your quick reply! I think I might just copy my question from my new issue and close that one:

Q1. Did you mean that both of you manually delete the code that automatically load 'trained_models/pretrain_hrnet.pkl' ? I think it's appeared in your log files and the logic of the code is to automatically load it if not args().fine_tune and not args().eval. I saw the model_path in your log_file is not None, which means you should have used it as well. Did you mean you didn't use pre-trained model or backbone?

Q2. I think 'trained_models/pretrain_hrnet.pkl' is a backbone pretrained on 2D pose estimation, instead of initialized weights from ImageNet, is this correct?

As I understand, using v1.yml with pre-trained backbone: 'trained_models/pretrain_hrnet.pkl' should reproduce Table 2 and trained_models/ROMP_HRNet32_V1.pkl. Then, with this model and v1_hrnet_3dpw_ft.yml, should reproduce Table 3. Is this the case?

About fine-tuning: Yes I have tried fine-tuning on 3DPW using trained_models/ROMP_HRNet32_V1.pkl and v1_hrnet_3dpw_ft.yml, it gave me a very similar results to your paper. And trained_models/ROMP_HRNet32_V1.pkl also gave me the same results as table 2. Both of them can reproduce perfectly.

However, I'm just trying to reproduce trained_models/ROMP_HRNet32_V1.pkl, which can be used for table2, and the following fine-tuning process as well, which I think could be achieved by using v1.yml and 'trained_models/pretrain_hrnet.pkl'. So I'm just curious about 'trained_models/pretrain_hrnet.pkl', thus having the questions above.

Arthur151 commented 2 years ago

Q1 & Q2: Yes, in my code, it doesn't automatically load / use the pre-train model. You may observe that the loss in my posted log is relatively larger than his. Q3: Yes, they are supposed to achieve that.

Glad to hear that. Based on this information, I am more confident that the problem is more training time. I took about two weeks to get ROMP_HRNet32_V1 with my previous code on 4 V100. How long did you take?

ZhengdiYu commented 2 years ago

Q1 & Q2: Yes, in my code, it doesn't automatically load / use the pre-train model. You may observe that the loss in my posted log is relatively larger than his. Q3: Yes, they are supposed to achieve that.

Glad to hear that. Based on this information, I am more confident that the problem is more training time. I took about two weeks to get ROMP_HRNet32_V1 with my previous code on 4 V100. How long did you take?


Wow, so your log is completely training from scratch then, even without 2D pose pre-trained backbone ( not model, but backbone) and ImageNet? It's 80 epochs and still can't reproduce yet, so I guess it might take weeks to reproduce Table 2? But I think with your current code (which will automatically load pre-trained backbone on 2D Pose Estimation, pretrain_hrnet.pkl), it will still need more or less the same time.

In fact, I found my performance is similar to your log, even with your current code, v1.yml, which will automatically load pre-trained backbone: 'trained_models/pretrain_hrnet.pkl', which I think it's pretrained on 2D pose estimation, and I trained for 2 days, less than 20 epochs.

You spent only 9 epochs to get 91.57 | 52.97, and the following 68 epochs didn't really improve the results significantly as it's not even evaluated.

Using batchsize16, I can get similar results (91.87 | 53.49) in the fourth epoch and a slightly worse one in the first epoch 91.65 | 54.54. But it's strange that it's a bit worse when I use 16 batchsize with 4 accumulation gradients, which should be similar to 64 batch size. Anyway, seeing from their best results, both of them can get PA_MPJPE around 53 to 53.5. and MPJPE around 91 to 92.5 in the first 6 epochs

It's indeed faster in the first 5 epochs, But after that, I see no significant improvements comparing to your log, which is without pretrain_hrnet.pkl, 2D pose pretraining.

So, I think in the long term, to fully reproduce your results of table 2, using the pre-trained backbone or not doesn't really matter with only several epochs faster. Is this correct? If you didn't use pre-trained 2D pose estimation backbone at all, I think Pre-trained backbone is not really necessary then, it's only several hours faster in the first 5 epochs

Anyway, I think 91/53 is already enough for some application, could I assume that we can actually get a considerable model with only around 10 epochs and even without any pre-training at all ? If so, I think pre-training is not as necessary as I thought before, as you mentioned, it should be very important. Could you tell me what can pre-training bring us except really only slightly faster in the first several hours?

Thank you in advance! Looking forward to your response!

Arthur151 commented 2 years ago

Yes, I agree with you on most part. Really appreciate your interests in ROMP. Here are some observations that might helpful. 1.The evaluation is only performed when we get better results on validation set, but this doesn't represent that the model is not getting better on test set. 2.I didn't try batch size 16, but ROMP_HRNet32_V1.pkl is trained with batch size 64. 3.Pretrain is necessary in two aspects, better detection and robustness in crowded scene. 3DPW is a great benchmark. It greatly promotes the development of this field. But it only contains ground truth of 1/2 people in the video lacking enough crowded scenes. Similarly, most training datasets, including 2D pose datasets, is relatively weak in this aspect. With pretraining on benchmarks like CrowdHuman, we are able to get a model that more robust for more challenging scene. In one word, we need to expect more than what current benchmark evaluation can reflect.

ZhengdiYu commented 2 years ago

Yes, I agree with you on most part. Really appreciate your interests in ROMP. Here are some observations that might helpful. 1.The evaluation is only performed when we get better results on validation set, but this doesn't represent that the model is not getting better on test set. 2.I didn't try batch size 16, but ROMP_HRNet32_V1.pkl is trained with batch size 64. 3.Pretrain is necessary in two aspects, better detection and robustness in crowded scene. 3DPW is a great benchmark. It greatly promotes the development of this field. But it only contains ground truth of 1/2 people in the video lacking enough crowded scenes. Similarly, most training datasets, including 2D pose datasets, is relatively weak in this aspect. With pretraining on benchmarks like CrowdHuman, we are able to get a model that more robust for more challenging scene. In one word, we need to expect more than what current benchmark evaluation can reflect.

Thanks! So actually, if we only talk about the evaluation on 3DPW dataset, instead of further application, pre-training is not really indispensable, right (Seeing from the logs) ?

Arthur151 commented 2 years ago

Yes, you can see it in that view.Get Outlook for Android

ZhengdiYu commented 2 years ago

Yes, you can see it in that view.Get Outlook for Android

These 2 days, I have tried to comment this line of code and trained with v1.yml, but the performance is quite different from yours:

https://github.com/Arthur151/ROMP/blob/d463cbccc6a4456270d8d52fcfa1e9e26b6c4f02/romp/lib/models/modelv1.py#L34 image

  1. The only difference is that I adapt the batch size to 16, and another setting to 16 with gradients accumulation 4 (which should be similar to 64). However it's they are not even close to your log and what I expected: bs16x4_load.log bs16x4_not_load_backbone.log bs16_load.log bs16_not_load_backbone.log

Once I comment the code that used for loading pre-trained backbone, the training completely failed with huge MPJPE... and RuntimeError: svd_cuda: (Batch element 0): The algorithm failed to converge because the input matrix is ill-conditioned or has too many repeated singular values (error code: 4).

16 batch size worked when I load pre-trained backbone(pretrain_hrnet.pkl) or fine-tune, because I can get a similar result to his log using both 16 and 16*4. And I can get a similar result to table 3. using your pre-trained model ROMP_HRNet32_V1.pkl. What do you think might cause this? It's a quite huge deviation which I don't think is simply due to batch_size.

  1. I have looked into your previous version of ROMP, it seems that from 2021.9.10, when you first released the training code, it already had the code to automatically load the pre-trained backbone. Can you still remember which version you were using actually?: https://github.com/Arthur151/ROMP/blob/ee5e2f21f35a1072327a11ecd4a36c0c64d805e1/romp/lib/models/modelv1.py#L34
Arthur151 commented 2 years ago

A lot of works have been made during Spring festival. You are too hard-working :+1: . About your question. I think that you might need to uncomment this line to start from a good intialization at least. I have maintained my own version of ROMP & BEV for training, which is a little different from the one released. I will release the latest version with BEV in next update. About training from scratch, there are some essential tricks that might take care of. Firstly, to ensure converge, the model need to learn from cropped single-person image firstly, which is much easier to learn than crowded image. To make this, we need to set shuffle_crop_ratio_3d=0.9, shuffle_crop_ratio_2d=0.9, which is the original effectiveness of these two settings. I think I might have done this before the formal training. Without pretraining, we really need a large batch size to make it converge.

ZhengdiYu commented 2 years ago
  1. I think there is no need to uncomment the init_weights() in hrnet.py as it's already a part of ROMP(). self.modules in ROMP() would also return HRNet components as its children. And, in your last history version before releasing training code, there was no init_weights in hrnet.py as well.

  2. I tried shuffle_crop_ratio_3d=0.9, shuffle_crop_ratio_2d=0.9 today. Although your log shows 0.4 and 0.1 as in v1.yml. However, the results are still far from yours. After epoch0 and epoch1, the MPJPE/PA_MPJPE is still approaching 180-200 and 100-120, but your log had MPJPE lower than 100 in the 0 epoch : batch16.log batch16x4.log

It's weird that I can get similar results like his when loaded the pre-trained backbone but having huge deviation from yours if not.

Are there any other differences between your version and the released version? Maybe 'cfg.new_training'? Have you ever tried to use the released code without self.backbone.load_pretrain_params() ?

Arthur151 commented 2 years ago

No, I didn't try to train from scratch using released code. It would take weeks to go through the pretraining process. You will see it slowly converge to the performance similiar to the one with pretraining backbone.

Besides, I look into you log. There are some weird things. The loss of 16x4 is not the sum of det and reg loss. It seem like that the loss is averaged? Please check to ensure that using gradient caching has similar effects with using large batch size. We all know that large batch size is critical for the normal converge, especially in pretraining process.

ZhengdiYu commented 2 years ago

No, I didn't try to train from scratch using released code. It would take weeks to go through the pretraining process. You will see it slowly converge to the performance similiar to the one with pretraining backbone.

Besides, I look into you log. There are some weird things. The loss of 16x4 is not the sum of det and reg loss. It seem like that the loss is averaged? Please check to ensure that using gradient caching has similar effects with using large batch size. We all know that large batch size is critical for the normal converge, especially in pretraining process.

Yes, it might take weeks to finish the whole process but it would only take several hours to confirm that its performance is similar to your log using your own version (e.g. in the first several epochs).

And, regarding the loss, the 16x4 version's loss is devided by 4 as suggested, simply adapted as below: image

Anyway, I might be able to use large GPU in couple of days, I would try again with the large ones then. With only 2 changes, comment self.backbone.load_pretrain_params(), and two crop ratio=0.9.

Arthur151 commented 2 years ago

Sure, thanks for your interest. I may check this later.

Arthur151 commented 2 years ago

@ZhengdiYu , besides, for the nan loss, you can use torch.isnan() to avoid its influence. About your problem, I have launch a training to locate the problem. Please consider to give more time for training from scratch. You will see the performance gets much better if you give ROMP enough training time. More time for training, better performance you could see. After trained hundreds of epochs, ROMP is still converging... I think I still haven't reached to the best state ROMP can be.

Arthur151 commented 2 years ago

@ZhengdiYu Hi, I have tried last night. Till now, the log is similar to yours. V1_hrnet_nopretrain_check_h36m,mpiinf,coco,mpii,lsp,muco,crowdpose_g0,1,2,3.log You are right. I might have load the pre-trained backbone before. Sorry for the mistake. Thanks a lot for pointing it out. I have found the reason that occurs NaN loss. It is caused by the pkp2d loss. I will try to fix this in next update.

ZhengdiYu commented 2 years ago

@ZhengdiYu Hi, I have tried last night. Till now, the log is similar to yours. V1_hrnet_nopretrain_check_h36m,mpiinf,coco,mpii,lsp,muco,crowdpose_g0,1,2,3.log You are right. I might have load the pre-trained backbone before. I have found the reason that occurs NaN loss. It is caused by the pkp2d loss. I will try to fix this in next update.

Thanks for your updates. I can see that your loss is lower than mine, maybe due to the large batch_size, but the MPJPE is indeed not as expected, although, still better than mine. However it seems yours is able to converge, now it's approaching 300 loss with 200 reg loss, which I think is not far from the previous log. Let's wait and see. I think we can wait for some time and let's see what will happen to yours next.

Arthur151 commented 2 years ago

Sorry, I was just to verify the log and bug. Currently, I have to move on to other urgent tasks. I will continue the training later when the schedule is available.

ZhengdiYu commented 2 years ago

Sorry, I was just to verify the log and bug. Currently, I have to move on to other urgent tasks. I will continue the training later when the schedule is available.

That's OK! Thanks for your help~ Good luck

Arthur151 commented 2 years ago

Please note that, about backbone pretraining, one can use the pre-trained model of Higher-HRNet-32. It would be very helpful for faster converge.