andvg3 / LGD

Dataset and Code for CVPR 2024 paper "Language-driven Grasp Detection."
https://airvlab.github.io/grasp-anything/
MIT License
21 stars 1 forks source link

Consults in details of training LGD model in Grasp-anything++ database. #4

Closed lkccu closed 1 week ago

lkccu commented 1 month ago

Thank you for your great work. However, I am little confused on the released code. I will be highly appreciated if you could help me out. 1.In LGD/train_network_diffusion.py#L247, it appears that the loss computed by GaussianDiffusion:training_loss is superseded by that of net:compute_loss. Does this imply that the ddpm module is not involved in updating the network parameters? (https://github.com/andvg3/LGD/blob/2471385507a883ba9147624ce2b3983e691f5fd9/inference/models/lgdm/network.py#L140)

https://github.com/andvg3/LGD/blob/2471385507a883ba9147624ce2b3983e691f5fd9/diffusion/gaussian_diffusion.py#L808

https://github.com/andvg3/LGD/blob/2471385507a883ba9147624ce2b3983e691f5fd9/train_network_diffusion.py#L237-L247

2.In LGD/diffusion/gaussian_diffusion.py:#L867, the guiding_point is computed using net.pos_output, with pos_output being updated by x+net.pos_output. My interpretation is that x represents the noised q_score_map at time t, while net.pos_output might be the predicted q0 or noise added to the q_0 at time t. I am confused which is the right way to interpret updated net.pos_output: x + net.pos_output

https://github.com/andvg3/LGD/blob/2471385507a883ba9147624ce2b3983e691f5fd9/diffusion/gaussian_diffusion.py#L866-L868

https://github.com/andvg3/LGD/blob/2471385507a883ba9147624ce2b3983e691f5fd9/inference/models/lgdm/network.py#L114-L119

3.Furthermore, I've observed that t_embedding is not utilized in the computations. In LGD/inference/models/lgdm/network.py#L107, the lgd architecture appears to be simliar to grconvnet.

https://github.com/andvg3/LGD/blob/2471385507a883ba9147624ce2b3983e691f5fd9/inference/models/lgdm/network.py#L98-L117

I would be most grateful for any clarification you could provide on these matters. Thank you in advance for your time and consideration. Best regards, lkc.

andvg3 commented 1 month ago

Hi @lkccu ,

Does this imply that the ddpm module is not involved in updating the network parameters?

No. The denoising process was involved in the training process and evaluation through the call of WrappedModel in LGD/diffusion/respace.py. Please refer to L227 of LGQ/train_network_diffusion.py. The one you mentioned was use for evaluation metrics, not training loss.

I am confused which is the right way to interpret updated net.pos_output.

It is our design choice to implement the proposed architecture in our paper using guiding points as attention maps. We add the guiding pos_output predicted from conditions to the actual pos_output predicted from the denoising process, or q_0 as you mentioned. This adding mechanism aligns with our network design.

The lgd architecture appears to be simliar to grconvnet.

They are similar to some extend. However, our network integrates combining features from ALBEF modules and adding guiding features, as I have explained in your previous question. The t_embedding is the embedding from input timestep, together with text and image features, helps to predict guiding_points.

I hope my explanations help. Please do not hesitate to contact me if you have any further questions.

Best regards, An Vuong

lkccu commented 1 month ago

Thank you for your kind and so detailed explanation! It is so helpful. Regards, lkc

lkccu commented 1 month ago

Hi, @andvg3 I want to train lgrconvnet3/clipfusion/proposed LGD mode in Grasp-anything++ dataset.What is the recommended settings? Best regards, lkc.

andvg3 commented 1 month ago

Hi @lkccu ,

You should first download both Grasp-Anything and Grasp-Anything++ datasets. Then please follow the step-by-step instructions provided in the README file, it should work properly. Please let me know if you have any further questions.

Best,

An Vuong

andvg3 commented 1 week ago

Closed due to inactivity.