We present a framework for training multi-modal deep learning models on unlabelled video data by forcing the network to learn invariances to transformations applied to both the audio and video streams.
Hi,
Can you share the Fully supervised Kinetics trained R(2+1D)-18 and the Kinetics pretrained STiCA models? I am doing a self-supervised learning survey where I am comparing different self-supervised methods. I would like to include your STiCa method and a comparison with fully supervised learning too. Hoping for a positive response.
Hi, Can you share the Fully supervised Kinetics trained R(2+1D)-18 and the Kinetics pretrained STiCA models? I am doing a self-supervised learning survey where I am comparing different self-supervised methods. I would like to include your STiCa method and a comparison with fully supervised learning too. Hoping for a positive response.