Please refer to the details of the reference images in Section 4.2, "Improving Photorealism and Accuracy".
"A reference visual image captures the original scene without any robot-object interaction."
"A reference tactile image captures the tactile response when the sensor touches nothing, which can help the system calibrate the tactile input, as different GelSight sensors have different lighting distribution and black dot patterns."
How the reference visual and tactile images in each sample are selected from the collected data? I'm not quite sure about this.