Closed charlesliangcai closed 2 months ago
Thanks for your question.
There are three main reasons: 1)Modular Decoupling: These two submodules are decoupled, meaning that training them separately ensures that modifications to one submodule do not affect the other. This allows for the retraining of only the modified module. 2)Resource Efficiency: Training both submodules simultaneously would significantly increase memory usage and computational burden. Some GPUs, such as the 40GB A800, may be insufficient if you want to avoid reducing the batch size. 3)Different Data Requirements: The two submodules utilize different types of partially labeled data. The goal image generation model can use data without action labels but still requires language annotations. In contrast, the policy submodule can use data without language annotations but requires action labels. Training both submodules together would necessitate fully annotated data, limiting the available dataset.
In simulation experiments, the answer is yes. Although we haven't tested it, using CALVIN alone to train the goal image generation model proves to be sufficient in simulations.
My question 1 might not have been clear. Divide to 3 step is necessary, and GR-1 has proved pretrain Multi-Modal Goal Conditioned Policy can achieve higher success rate in maniputation. but in step3 , we have full annotated CALVIN dataset , why not put Goal Image Generation Model and Multi-Modal Goal Conditioned Policy together for global finetune? if we want to reduce GPU cost , Goal Image Generation Model can also set be freezed?
Are you suggesting using a generated goal image instead of the ground truth goal image to train the policy? If so, the training time may become impractical due to the slow nature of the diffusion process.
get it. thanks for your reply
Hi! Thank you for this nice work! I have two questions about this paper: