LAION-AI / project-menu

Projects at LAION
MIT License
10 stars 4 forks source link

expand to multilingual captions #8

Closed rom1504 closed 2 years ago

rom1504 commented 2 years ago

Multilingual clip can be used as a filter https://huggingface.co/sentence-transformers/clip-ViT-B-32-multilingual-v1

try it at https://colab.research.google.com/drive/1kOKsf-X0nS2_ael7scPa9raf0pBmxUgW#scrollTo=ScN8GQwN3xXq

First experimentations show that clip is still ok at latin languages (french, portuguese, spanish, french) very bad at non latin, especially so for non european mclip keeps the same ordering of performances, but it still ok for the most difficult languages so maybe we want to decrease 0.3 to like 0.26 for mclip

rom1504 commented 2 years ago

interesting things to do:

rom1504 commented 2 years ago

probably https://github.com/ARKseal/crawlingathome-worker/blob/master/clip_filter.py

rvencu commented 2 years ago

please use this, it is the latest version in use https://github.com/rvencu/crawlingathome-gpu-hcloud/blob/main/clip_filter.py

rom1504 commented 2 years ago

this is ongoing, already 700M samples extracted

rom1504 commented 2 years ago

done and released