Abraham190137 / TactileACT

Incorporating Tactile Signals into the ACT framework for peg insertion tasks
MIT License
7 stars 0 forks source link

question about pretraining method #3

Open a510721 opened 3 weeks ago

a510721 commented 3 weeks ago

Thank you for sharing a good paper.

In the paper, if you look at the result of the experiment in Fig 6, there is a vision only test result, and it was said that the performance would be improved by pretraining ACT.

In the paper, CLIP applied contrastive learning by combining an image and a tactile image, I am curious about how vision only was pretrained? If possible, would it be possible to share the method and code?

Abraham190137 commented 3 weeks ago

Hello, thanks for your interest in the paper!

The vision-only agent was pre-trained using tactile and visual data, the same as the visuo-tactile agent. The code can be found in the CLIP pretraining file.

For all of our experiments, we collected a demonstration data-set with both tactile and visual data. We then pre-trained a vision encoder and tactile encoder using a contrastive loss (CLIP), which we used in an imitation learning (IL) framework (either ACT or Diffusion Policy), fine-tuning the encoder during the IL model training.

For the vision-only experiments, we did the same visual-tactile pretraining step, but we used a vision-only IL model. This IL model only used the pre-trained vision encoder (we discarded the tactile encoder). This resulted in a vision-only policy that was trained using visual and tactile data. So the vision-only agent is vision-only at inference, but training (well, pre-training) is visuo-tactile.

Of course, this means that you still need to collect both visual and tactile data to train the vision-only agent. However, by only requiring visual data during inference, we remove the need for costly and delicate tactile sensors during deployment, while preserving most of the benefits gained from using tactile data.

As a side note, you could potentially get around some of the tactile data collection by only collecting some of the demos with tactile data, or only collecting a subset of tasks with tactile observations if you're doing multi-task learning, then only using that tactile subset of demos for the pretraining step.