UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
14.78k stars 2.43k forks source link

How to handle OOV words #414

Open harrypotter90 opened 4 years ago

harrypotter90 commented 4 years ago

Hi, I am using one of the "Sentence similarity" models e.g. 'distilbert-base-nli-stsb-quora-ranking' As in my domain, I am sure I have quite unique words in my use-case. How can I handle OOV words when getting sentence embedding?

If there a way I can get a vocabulary list being used to train the model.

Thanks

nreimers commented 4 years ago

Have a look at the BERT paper and the section about word pieces.

BERT (and other transformer networks) don't use words. They have a fixed size vocabulary consisting of character ngrams. So OOV only happen if your words consists of characters that are not in that vocabulary. If you word consists of only characters e.g. a-z, the worst case what can happen is that this word is broken down into individually characters.

harrypotter90 commented 4 years ago

Thanks for your quick reply. I understood, But I have realized i am using a different model based on 'Roberta'

I guess i have to read its documentation, how they have handled OOV.

 "architectures": [
    "XLMRobertaModel"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "xlm-roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "output_past": true,
  "pad_token_id": 1,
  "type_vocab_size": 1,
  "vocab_size": 250002
}

And here "vocab_size", I thought its some kind of a number of words.

Problem is that if I use an input word for example "autonomic dysreflexia" then the similarity score between these 2 words: ["dysreflexia is a syndrome", "latent pre"]:

"latent pre" has the highest match with "0.88", which i am not able to make sense of. Also, note that it works great for other examples.

nreimers commented 4 years ago

RoBERTa uses the same approach, but with a larger vocab size (more ngrams), so words are more seldom broken down into smaller pieces.

For individually words, contextualized word embeddings / sentence transformers does not work that well in my experience, especially as it was not trained for it. It was trained on complete sentences from "general english", i.e., for specialized terms, the results can be quite odd.

harrypotter90 commented 3 years ago

Hi, for example, can this be a vocabulary for XLM-R model : https://github.com/facebookresearch/XLM#pretrained-cross-lingual-language-models .

This means that these are the words that this model has seen.

Thoughts?

harrypotter90 commented 3 years ago

I have realized that this is the list of vocabulary by which this model has been trained. for example "respira@@" is in that list, which will include respiratory, respiration, etc. So words that are not in this list, means are out of vocabulary words.

freeIsa commented 3 years ago

Just wanted to add that some relevant OOV words are usually emojis; at least this is the case with distiluse-base-multilingual-cased. Right now I am experimenting with xlm-r-distilroberta-base-paraphrase-v1 and it looks like emojis are in the tokenizer vocabulary, is that right?

nreimers commented 3 years ago

Have a look here, how XLM-R was created: https://arxiv.org/abs/1911.02116

freeIsa commented 3 years ago

Have a look here, how XLM-R was created: https://arxiv.org/abs/1911.02116

Thanks for the pointer, I will look into it - being based on CommonCrawls dumps rather than just Wikipedia data, XLM-R vocabulary should also contain emojis, but I will check this in depth.

Samarthagarwal23 commented 3 years ago

Problem is that if I use an input word for example "autonomic dysreflexia" then the similarity score between these 2 words: ["dysreflexia is a syndrome", "latent pre"]:

"latent pre" has the highest match with "0.88", which i am not able to make sense of. Also, note that it works great for other examples.

@nreimers I had similar question - For fine-tuning sentence-bert models, how many data points (approx.) will it take for model to make sense of words like "dysreflexia" .. In my use case, i have a similar words with 10-15 examples, but it results on similarity isn't good still.

nreimers commented 3 years ago

@Samarthagarwal23 It depends how well the underlying model know these words, i.e. have they been in the corpus for which the Mask Language Model was trained on? If not, it would be required that the word is in the training data. Just having similar words in your fine-tuning data would not be sufficient.

Interesting paper on this: https://arxiv.org/abs/2004.10964

Samarthagarwal23 commented 3 years ago

Thanks @nreimers . For such cases, do you think combining with simpler approach might lead to better performance? Like combining embeddings (sbert + countvec / tfidf vectors for some proper nouns) for similarity search? Any experience or research you are aware of?

nreimers commented 3 years ago

Yes, combining dense and sparse (lexical) retrieval is helpful: https://arxiv.org/abs/2005.00181 https://arxiv.org/pdf/2004.13969.pdf

Combination can be quite easy: 1) You retrieve docs with the dense approach & cosine similarity 2) You retrieve docs with BM25 3) You compute for every doc a score: final_score = bm25_score_i / max(bm25_score) + lambda * cos_score_i / max(cos_score) 4) Sort your hits based on final_score

With lambda some weight to give more influence to BM25 or to cosine similarity. And max() the maximal score from all retrieved hits.

Samarthagarwal23 commented 3 years ago

@nreimers Thanks for sharing the papers and logic.

What do you think about combining dense + sparse (May be countVec) into single embedding for retrieval? Might have to give sparse embedding more weight in this case though.

nreimers commented 3 years ago

Not sure this will work, as retrieval (and also storing) dense and sparse vectors is quite different. I would keep it as two different vectors.

Samarthagarwal23 commented 3 years ago

Yes, combining dense and sparse (lexical) retrieval is helpful: https://arxiv.org/abs/2005.00181 https://arxiv.org/pdf/2004.13969.pdf

Combination can be quite easy:

  1. You retrieve docs with the dense approach & cosine similarity
  2. You retrieve docs with BM25
  3. You compute for every doc a score: final_score = bm25_score_i / max(bm25_score) + lambda * cos_score_i / max(cos_score)
  4. Sort your hits based on final_score

With lambda some weight to give more influence to BM25 or to cosine similarity. And max() the maximal score from all retrieved hits.

Thanks. it worked well in my use case (combining lexical and dense)

@nreimers any resources of lexical + dense embeddings in multi-lingual settings? (For example, English, CHinese, Bahasa)

nreimers commented 3 years ago

@Samarthagarwal23 Sadly I don't know any