question about pretraining method

Hello, thanks for your interest in the paper!

The vision-only agent was pre-trained using tactile and visual data, the same as the visuo-tactile agent. The code can be found in the CLIP pretraining file.

For all of our experiments, we collected a demonstration data-set with both tactile and visual data. We then pre-trained a vision encoder and tactile encoder using a contrastive loss (CLIP), which we used in an imitation learning (IL) framework (either ACT or Diffusion Policy), fine-tuning the encoder during the IL model training.

For the vision-only experiments, we did the same visual-tactile pretraining step, but we used a vision-only IL model. This IL model only used the pre-trained vision encoder (we discarded the tactile encoder). This resulted in a vision-only policy that was trained using visual and tactile data. So the vision-only agent is vision-only at inference, but training (well, pre-training) is visuo-tactile.

Of course, this means that you still need to collect both visual and tactile data to train the vision-only agent. However, by only requiring visual data during inference, we remove the need for costly and delicate tactile sensors during deployment, while preserving most of the benefits gained from using tactile data.

As a side note, you could potentially get around some of the tactile data collection by only collecting some of the demos with tactile data, or only collecting a subset of tasks with tactile observations if you're doing multi-task learning, then only using that tactile subset of demos for the pretraining step.

Abraham190137 / TactileACT

question about pretraining method #3