ai-forever / ru-clip

CLIP implementation for Russian language
Apache License 2.0
137 stars 36 forks source link

Training data #15

Open christophschuhmann opened 2 years ago

christophschuhmann commented 2 years ago

I would like to know on what ruCLIP was trained. We, LAION, have around 6B yet unreleased img-text-pairs, filtered with CLIP and mCLIP. Many of them also are russian. :)

If you 'd like access, let me know.

Christoph Schuhmann www.laion.ai

shonenkov commented 2 years ago

@christophschuhmann Hello! Your dataset LAION is incredible. As a researcher, I would be interested in working with your dataset in the Russian language.

ruCLIP was trained on datasets from open sources, datasets of the Sberbank ecosystem, and sample datasets translated using neural networks. We collected about 240M pairs, with only 100M in "native" Russian. The data turned out quite noisy, but the signal for ruCLIP is definitely in them.

My colleague Andrey Kuznetsov sent you an e-mail christoph_s@freenet.de . Could you discuss with him the conditions and rules of your dataset? We would be very grateful for your help.

christophschuhmann commented 2 years ago

Nice to hear from you, I have not received an email yet on christoph_s@freenet.de Maybe it got caught in a spam filter. Could he sent it again to christoph.schuhmann@laion.ai

Waiting to hear from you :)