Jiyao06 / GenPose

[NeurIPS 2023] GenPose: Generative Category-Level Object Pose Estimation via Diffusion Models
https://sites.google.com/view/genpose
MIT License
144 stars 4 forks source link

how to train? #19

Closed fuzhao123232 closed 8 months ago

fuzhao123232 commented 8 months ago

Dear author, Thanks for your SOTA job. For train_score.sh, single card(4090), batch 192, when trained to 10 epochs, the loss decreased to 0.35-0.45, and then from 10-800 epochs, the train's loss kept jumping back and forth between 0.35-0.45, feeling it had converged. Then, evaluate and compare the author's paper. The average 5 ° 2cm loss was close to 10 points, and the values of other indicators were also low. That is to say, I reached a local optimum and couldn't decrease. How did the author train it? image my eval result: image eval from author's kps: image

fuzhao123232 commented 8 months ago

I try to train score network using 4 cards(A10 4), batch size = 1924, and during epoch 172, the loss is as follow: image during epoch 327 image It seem to be converged,but the loss is bigger than sigle card(0.35). I really don't know how to train the score network to reproduce the author's results.

fuzhao123232 commented 8 months ago

The difference between my code and author's code is dataloader: For to speed CPU IO, I save the pointclouds for get_ittem.

image

The pointcloud's generate script is as follow: image image image image

In fact, I have visual the results to check ,and I found the generated point cloud and RGB image correspond one-to-one. image image Therefore, I think my dataprocess is no problem. But the train is difficult.

fuzhao123232 commented 8 months ago

image 请问作者论文中的结果是否使用了teacher model?最后的train loss大概降到了多少呢?

Jiyao06 commented 8 months ago
  1. To determine if the training process has converged, you can assess this by visualizing the training curve.
  2. Regarding your question about the higher loss when training with multiple GPUs, this could be due to the increased batch size without appropriately adjusting the learning rate.
  3. We did not use a teacher model during our training process.
sseunghyuns commented 8 months ago

Hello! I wonder if you succeeded in reproducing the results of the paper.

fuzhao123232 commented 8 months ago

[图片] The results when I trained 320 epoch, Maybe when epoch reach to 1900, the results can be reproduced. I need more time. image 안녕하세요, 친구, 이전 데이터 처리에 문제가 생겨서 일부 Real의 실제 데이터 훈련만 사용하여 결과를 재현할 수 없게 되었습니다.나는 지금 이 문제를 수정했다.그리고 이미 300여 개의 epoch를 훈련시켰다. 그림은 나의 평가 결과이고 loss는 여전히 떨어지고 있다. 그래서 나는 마지막으로 1900개의 epoch를 훈련해도 저자의 결과를 재현할 기회가 있다고 생각한다.

sseunghyuns commented 8 months ago

[图片] The results when I trained 320 epoch, Maybe when epoch reach to 1900, the results can be reproduced. I need more time. image 안녕하세요, 친구, 이전 데이터 처리에 문제가 생겨서 일부 Real의 실제 데이터 훈련만 사용하여 결과를 재현할 수 없게 되었습니다.나는 지금 이 문제를 수정했다.그리고 이미 300여 개의 epoch를 훈련시켰다. 그림은 나의 평가 결과이고 loss는 여전히 떨어지고 있다. 그래서 나는 마지막으로 1900개의 epoch를 훈련해도 저자의 결과를 재현할 기회가 있다고 생각한다.

Thanks!

fuzhao123232 commented 7 months ago

eval when epoch is 1664: image compare to paper: image

I think maybe the train epoch will be > 2000, it really ..... Why is the convergence so slow? I will be cry.

sseunghyuns commented 7 months ago

Did you also trained EnergyNet? Or did you use author's pretrained checkpoint? I'm also suffering slow convergence... Moreover I trained 1032 epochs and the evaluation score is much lower than yours @fuzhao123232.

image
fuzhao123232 commented 7 months ago

I used author's pretrained checkpoint for EnergyNet and my Scorenet weights. I think the trained results might be relevent with the random initial weights。And I think you need to check your datasets is really ok?

sseunghyuns commented 7 months ago

Thank you! I think I have to check the dataloader. If it's fine, could you show me the loss curve recorded in the tensorboard?

sseunghyuns commented 7 months ago

Thank you! I think I have to check the dataloader. If it's fine, could you show me the loss curve recorded in the tensorboard?

@fuzhao123232 It seems that my dataloader is fine.

Also, in the comments above, when you trained 10 epochs, the loss dropped to 0.45. Was it always in the same boundary when you used single GPU or multiple GPUs? In my case, when I trained 100 epochs with single GPU(or 2 GPUs), the loss was 0.6-0.7. Did you make any changes to the default config settings? (I'm referring to your code for pre-processing and loading the pcd data as npy files. When I visualized the dataloader, it seemed to be working fine.)

sseunghyuns commented 7 months ago

Thank you! I think I have to check the dataloader. If it's fine, could you show me the loss curve recorded in the tensorboard?

@fuzhao123232 It seems that my dataloader is fine.

Also, in the comments above, when you trained 10 epochs, the loss dropped to 0.45. Was it always in the same boundary when you used single GPU or multiple GPUs? In my case, when I trained 100 epochs with single GPU(or 2 GPUs), the loss was 0.6-0.7. Did you make any changes to the default config settings? (I'm referring to your code for pre-processing and loading the pcd data as npy files. When I visualized the dataloader, it seemed to be working fine.)

(24.03.14) There was a problem with ground truth label pkl. I think it's going to be solved if I handle this issue