Open cfoster0 opened 3 years ago
Pull request #11 added a basic CLIP objective. What remains is to implement the microbatching tricks.
After a long period of dormancy, I've spent some time figuring this component out. The notebook below implements microbatching and parallelism for the contrastive loss in a dummy CLIP setup. Will work to incorporate this into the codebase in the coming weeks, and get us back on track!
https://colab.research.google.com/drive/1SzlPD4ptVtfINIfTt3lGyHwazvZumv9a?usp=sharing
CLIP's objective involves a symmetric cross entropy loss between the representstions of the text and the images (spectrograms, in our case) in a batch. It benefits from very large batch sizes (OpenAI did 32k). There are tricks we can do with microbatching to fit very large batch sizes.
You can compute the embeddings for your text with no_grad, and cache them. Then you compute the embeddings for spectograms in microbatches, computing their classification loss with the full set of texts. As you go along, cache those computed spectogram embeddings. Lastly, you do the other side of the loss, holding the spectogram embeddings constant and computing the loss for text embeddings in microbatches.
Some example code that implements a partial version of that is here:
https://gist.github.com/crowsonkb/a93904fbb88aff0302aac98dfdb26b5f#file-clip_coco_2-py-L182