UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
14.78k stars 2.43k forks source link

USE multilingual = 16 languages, sentence-transformers distilled version = 15 languages? #2565

Open ryanheise opened 5 months ago

ryanheise commented 5 months ago

I'm wondering, is this possibly a mistake in the documentation?

distiluse-base-multilingual-cased-v1: Multilingual knowledge distilled version of multilingual Universal Sentence Encoder. Supports 15 languages: Arabic, Chinese, Dutch, English, French, German, Italian, Korean, Polish, Portuguese, Russian, Spanish, Turkish.

The original multilingual Universal Sentence Encoder upon which this is based supports 16 languages, but in the distilled version, Japanese is missing.

distiluse-base-multilingual-cased-v1 seems to work just fine with Japanese, and it seems it supports Japanese better than distiluse-base-multilingual-cased-v2, even though the former doesn't declare support for it while the latter does. :thinking:

tomaarsen commented 5 months ago

Hello!

I suspect that this is indeed a mistake, although I can't be sure as I'm not sure whether the model is a direct copy of the one from Tensorflow Hub or whether it was further (?) tuned for those 15 languages in particular.

ryanheise commented 5 months ago

I also can't find any information on that. I've read what I can about how model distillation is done, but that unfortunately doesn't definitively shed light on the status of Japanese in this model since we don't know the specifics of how it was trained. It would be nice to have an official answer if possible (@nreimers ?).

Regarding whether it is a direct copy of the one from Tensorflow, the vectors from distiluse-base-multilingual-cased-v1 and distiluse-base-multilingual-cased-v2 do seem to align with each other for the same inputs, but they don't seem to align at all with the vectors from the original multilingual USE. I may be missing something here.

But the alignment between models is also an interesting question since if I have selected distiluse-base-multilingual-cased-v2 for my use case based on the Japanese requirement, and it turns out that distiluse-base-multilingual-cased-v1 works better for Japanese after all, it would be a big cost saving to be able to switch out one for the other, without having to reprocess all of the data that has already been processed, assuming that all of the vectors produced for English, Spanish and so on under distiluse-base-multilingual-cased-v2 will still be meaningful when compared to new vectors under the distiluse-base-multilingual-cased-v1 model.

nreimers commented 5 months ago

I think I didn't include Japanese in the distillation process. But it might still work for Japanese, as the underlying pre trained model support Japanese.

I would recommend to run your own tests and then select the model that works best

ryanheise commented 5 months ago

Thanks @nreimers for getting back to me. I do have a couple of questions:

  1. Would it be appropriate for me to contribute a donation to help get Japanese properly included? (Or was there some reason that made including Japanese infeasible?)
  2. Or if you would advise I run the distillation process myself, would you be able to share any more details on the specifics you used beyond this example?

I will do some more testing as you suggested in the short term, though, since as I pointed out, the existing v1 model does actually seem to outperform v2 on Japanese despite Japanese not being included in the distillation process.