dmlc / gluon-nlp

NLP made easy
https://nlp.gluon.ai/
Apache License 2.0
2.55k stars 538 forks source link

Allow assigning backoff method to fastText embedding models #434

Open Ishitori opened 5 years ago

Ishitori commented 5 years ago

I get KeyError exception if I use cc.zh.300 embedding with load_ngrams=True and provide a token which has no default embedding.

My expectation is that for an unknown token a ngrams strategy would kick in and I would get a combination of vectors. But maybe due to the size of the token (only 2 characters) or maybe because I use specifically cc.zh.300 embedding I get an error.

The minimum reproducing example is below:

import mxnet as mx
import gluonnlp
from gluonnlp import Vocab, data

zh_embedding = gluonnlp.embedding.create('fasttext', source='cc.zh.300', load_ngrams=True)
cntr = data.count_tokens(['猴姆'])
vocab = Vocab(cntr)
vocab.set_embedding(zh_embedding)

Exception is:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/gluonnlp/vocab/vocab.py", line 333, in set_embedding
    new_idx_to_vec[1:, col_start:col_end] = embs[self._idx_to_token[1:]]
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/gluonnlp/embedding/token_embedding.py", line 544, in __getitem__
    for token in tokens
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/gluonnlp/embedding/token_embedding.py", line 544, in <listcomp>
    for token in tokens
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/gluonnlp/model/train/embedding.py", line 444, in __getitem__
    raise KeyError

If I change token, for example by adding an exclamation mark to the end, to '猴姆!' the error disappear. The problem doesn't appear as well, if I change embedding to 'wiki.zh'.

GluonNLP version is '0.4.1'.

@leezu

leezu commented 5 years ago

Your guess is correct, the minimum ngram size in the pre trained model has length 3. So for words with less than 3 characters no vectors can be obtained. The fasttext model is not particularly helpful for Chinese in that sense as most words have less characters. For Chinese, character ngrams should rather be computed over radicals (but that is not what fasttext does). Would you be interested in working on some radical based model?

Ishitori commented 5 years ago

I see. Well, why does it fail then? I would expect to receive an empty vector for that case.

leezu commented 5 years ago

Well, the assumption was that you would probably have been more surprised to find some 0 vectors, not knowing why, than to realize that there is a problem thanks to a KeyError.

leezu commented 5 years ago

The KeyError just shows that fastText is not sufficient for the use-case. With a different embedding method, e.g. http://www.cse.ust.hk/~yqsong/papers/2017-EMNLP-ChineseEmbedding.pdf such error would not be raised.

Nevertheless, it would be nice to make the behavior configurable. I propose to generalize the current unknown_lookup in TokenEmbedding to a general backoff functionality and make it easily configurable. For your expectation to be met, you would then assign some zero_vector_fn as the backoff method of fastText. But it is currently not possible.

akshayamadhuri commented 1 year ago

import mxnet as mx import gluonnlp from gluonnlp import Vocab, data

embedding_model = 'cc.zh.300' # or 'wiki.zh' zh_embedding = gluonnlp.embedding.create('fasttext', source=embedding_model, load_ngrams=True)

def pad_token(token): if len(token) < 3: return token + '@@' else: return token

tokens = ['猴姆'] padded_tokens = [pad_token(token) for token in tokens] cntr = data.count_tokens(padded_tokens) vocab = Vocab(cntr) vocab.set_embedding(zh_embedding)

print(vocab.embedding[padded_tokens[0]])