Joint training in grasp generation

lym29 commented 9 months ago

Hi there,

I hope this message finds you well. I have been working on training the goal grasp generation stage but feel uncertain regarding to some details of this process, specifically in relation to joint training ipdf and glow with ContactNet.

Firstly, I would like to confirm whether my understanding is correct. Am I supposed to initially train these three networks independently and then proceed to fine-tune the glow network using the pretrained ipdf and ContactNet? I became a bit confused because in the provided config file, the checkpoint selection for ipdf and ContactNet is set as the initial one.

Furthermore, It would be so helpful if you could tell me the number of epochs you have used for the pretraining and joint training.

Thank you so much for presenting such a great work 👍

XYZ-99 commented 9 months ago

Thank you for your interest in our work!

Yes, your understanding is correct. They are firstly trained independently before being finetuned. The path in the config is more like a placeholder.

We don't have a strict number, but you should be fine once the training curves indicate the models are converging.

Sorry for the delay. Please let us know if you have further questions.

lym29 commented 9 months ago

Thank you for your interest in our work!

Yes, your understanding is correct. They are firstly trained independently before being finetuned. The path in the config is more like a placeholder.

We don't have a strict number, but you should be fine once the training curves indicate the models are converging.

Sorry for the delay. Please let us know if you have further questions.

Thank you so much for your clarification. I trained these three network again after received your reply. The loss curve for GLOW model has no tendency to converge, just same as what I got a few days ago. I noticed that the cmap_loss is remain 0 during training. I am wondering if the independent training of GLOW doesn't take the contact map as supervision? According to the paper, the translation and pose of hand should be fed into the ContactNet to generate a contact map, but in independent training, the output of ContactNet would be messy so we can't use them to supervise GLOW and IPDF, is that right?

So, if we don't use cmap to supervise the GLOW network, apart from NLL there will be no other supervision to train the normalizing flow. I'm wondering how to ensure convergence of the network if the distribution of the sample space is so sparse (only the GT data has a probability of 1, all other data is 0). When I test the GLOW model after about 200 epoch, the output is far from GT.

I apologize if there are any misunderstandings on my part since I am unfamiliar with normalizing flow. Please kindly point them out. Thanks!

lhrrhl0419 commented 9 months ago

Thank you so much for your clarification. I trained these three network again after received your reply. The loss curve for GLOW model has no tendency to converge, just same as what I got a few days ago.

I apologize for the delay in addressing this issue, which was caused by disk-related problems. The glow's training loss couldn't converge because of a bug in the training code and it has been fixed. I've retrained the glow with the new code and the loss converges now.

glow_training

I noticed that the cmap_loss is remain 0 during training. I am wondering if the independent training of GLOW doesn't take the contact map as supervision? According to the paper, the translation and pose of hand should be fed into the ContactNet to generate a contact map, but in independent training, the output of ContactNet would be messy so we can't use them to supervise GLOW and IPDF, is that right?

The cmap loss is only calculated in the joint training now, and I think that use this loss in the independent training would not severely harm the training process due to the supervision of the nll loss and would not have a significant effect as this additional loss is optimized in the joint training, but we haven't tried it.

So, if we don't use cmap to supervise the GLOW network, apart from NLL there will be no other supervision to train the normalizing flow. I'm wondering how to ensure convergence of the network if the distribution of the sample space is so sparse (only the GT data has a probability of 1, all other data is 0). When I test the GLOW model after about 200 epoch, the output is far from GT.

Theoretically, the glow can just remember all inputs and outputs and predict Dirac distribution as you've mentioned, but just like other networks, if the dataset is big enough, it is able to generalize and figure out the underlying distribution of the dataset, and the distribution of our dataset, which is generated by DexGraspNet, is not sparse. For example, there are a lot of bottles lying on the table in the data, and some of the grasping poses in the dataset grasp the upper part of it while some grasp the middle or the lower part. By minimizing the nll loss, the glow minimizes the KL divergence between the data distribution and the glow's output distribution, so that it will assign high probability on all of those grasping poses. Additionally, as glow learns to model the distribution of all poses that can grasp the object in the dataset, the large distance between output and the GT is acceptable.

If you have any further questions or encounter any other issues, please feel free to reach out.

lym29 commented 9 months ago

Thanks for the explanation! I understand your method now and have learned a lot from your work :)

PKU-EPIC / UniDexGrasp

Joint training in grasp generation #10