We present a framework for training multi-modal deep learning models on unlabelled video data by forcing the network to learn invariances to transformations applied to both the audio and video streams.
Whenever I try to pretrain using the code , the valid videos are shown as 0 . Well I have tried working with the code available in the supplementary material for the paper "Multi-modal Self-Supervision from Generalized Data Transformations" , though there were few errors , but valid videos were not zero. Is there in any difference in that code , for checking valid video?
Whenever I try to pretrain using the code , the valid videos are shown as 0 . Well I have tried working with the code available in the supplementary material for the paper "Multi-modal Self-Supervision from Generalized Data Transformations" , though there were few errors , but valid videos were not zero. Is there in any difference in that code , for checking valid video?