Closed rom1504 closed 2 years ago
interesting things to do:
please use this, it is the latest version in use https://github.com/rvencu/crawlingathome-gpu-hcloud/blob/main/clip_filter.py
this is ongoing, already 700M samples extracted
done and released
Multilingual clip can be used as a filter https://huggingface.co/sentence-transformers/clip-ViT-B-32-multilingual-v1
try it at https://colab.research.google.com/drive/1kOKsf-X0nS2_ael7scPa9raf0pBmxUgW#scrollTo=ScN8GQwN3xXq
First experimentations show that clip is still ok at latin languages (french, portuguese, spanish, french) very bad at non latin, especially so for non european mclip keeps the same ordering of performances, but it still ok for the most difficult languages so maybe we want to decrease 0.3 to like 0.26 for mclip