UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.15k stars 2.46k forks source link

Sentence embedding model with emojis support #1177

Open lualeperez opened 3 years ago

lualeperez commented 3 years ago

Hello all,

I was wondering if you guys have a sentence embedding transformer with emojis support. If not, could you please point me to a platform that does have it?

Many thanks,

Alejandro

nreimers commented 3 years ago

What do you mean with emoji support?

lualeperez commented 3 years ago

Thanks for the reply.

I think that I wasn't clear enough. What I mean by emoji support is if the pretrained model mas trained in a dataset that contained emojis, so that those are in the model vocabulary. And not only that, if the pretrained model was fine-tuned for sentence embeddings in a dataset of semantically similar sentences that contained emojis.

I can see that all-distilroberta-v1 was built using a pretrained model (distilroberta-base) trained on a dataset that contained emojis. I did the simple check of model.tokenizer.decode(model.tokenizer.encode(text)), where the text are emojis and I get back the same results. For other models like all-MiniLM-L6-v2, the emojis get decoded with the [UNK] token, meaning that the pretrained model was trained in dataset without emojis.

Looking at the model card of all-distilroberta-v1 (https://huggingface.co/sentence-transformers/all-distilroberta-v1#training-data) I see that it was fine tuned on a set of datasets which it is not clear that contain emojis. So, I don't know if I can use this model for semantic similarity of sentences that contain them.

Is there a model which used a pretrained model trained in a dataset with emojis, and that was fine-tuned in a dataset containing emojis?

thanks

nreimers commented 3 years ago

Hi @lualeperez The MiniLM model uses the BERT tokenizer, which was trained on Wikipedia. So not that many emojis.

DistilRoBERTa was trained on common crawl, i.e. large web corpus with many emojis. I would use that.

Downstream these models have been fine-tuned on data from Reddit, StackExchange, Yahoo Answers and several more, which also contain emojis.

But it is unclear how much semantic meaning an emoji encodes.

lualeperez commented 3 years ago

Thank you very much for the info. I will try it out.