Open iejMac opened 2 years ago
Test Different CLIP backbones - H/14 gets much better results and also isn't much slower (due to how slow video decoding is) so we will likely shift to H/14 embeddings (or maybe L/14) while video decoding is still the bottleneck. If we decide to change the architecture of clip-video-encode to alleviate this bottleneck we should revisit this question.