dmlc / gluon-nlp

NLP made easy
https://nlp.gluon.ai/
Apache License 2.0
2.55k stars 538 forks source link

vocabulary set_embedding(glove) maps all OOV terms to the same vector if no lookup provided #680

Open khui opened 5 years ago

khui commented 5 years ago

When attaching the pre-trained embedding to a built vocab, namely, vocab.set_embedding(glove_em), by design, all of the vocabulary that did not present in the pre-trained embedding will be mapped to the <'unk'> using init_unknown_vec (default nd.zeros).

This is a bit awkward since the <'unk'> was defined in building vocabulary already (e.g., using a threshold of mininal term frequency), and one may expect all of these OOV terms are at least mapped to different random vectors instead of the same vector.

Of course, when using FastText and enable the ngram or providing unknown_lookup could resolve it. However, as the default behavior, it is still a bit counter-converntion.

text_data = "Computing-Tabulating-Recording \\n affective-motivational \\n teacher-dealers"
counter = gluonnlp.data.count_tokens(text_data)
vocab = gluonnlp.Vocab(counter,  unknown_token='<unk>', min_freq=self.min_freq)
em = gluonnlp.embedding.create('GloVe',
                               unknown_token='<unk>',
                               source='glove.840B.300d',
                               allow_extend=True,
                               init_unknown_vec=nd.random.uniform)
vocab.set_embedding(em)
khui commented 5 years ago

One solution is to write a subsriptable class as the lookup function, which though looks like a workaround for me. Take a note here as for future referecnes.

class UnkLookup(object):

    def __init__(self, embedding_dim=50):
        self.embedding_dim = embedding_dim
        self.token_vec = {}

    def __getitem__(self, token):
        if token not in self.token_vec:
            self.token_vec[token] = nd.random.uniform(shape=(self.embedding_dim,))
        return self.token_vec[token]

    def __contains__(self, token):
        // hack to avoid the use of the embedding for '<unk>' if there is any.
        return True
unk_lookup = UnkLookup(embedding_dim=50)
em = gluonnlp.embedding.create(embedding_name,
                                       embedding_root=embedding_root,
                                       unknown_token=None,
                                       source=embedding_src,
                                       allow_extend=True,
                                       unknown_lookup=unk_lookup)
szha commented 5 years ago

Thanks for reporting it. Let's take this as part of the 0.7.0 roadmap as usability improvement item. Do you have a design in mind for the improvement? cc @astonzhang @leezu

astonzhang commented 5 years ago

Thanks. Please see my comments in the code.

# Case 1: '<unk>' in the text data SHOULD be considered as the representation for any unknown token

>>> import gluonnlp
>>> text_data = ['<unk>', 'a', 'b']  # Say, we want any unknown token to map to '<unk>' in text_data
>>> counter = gluonnlp.data.count_tokens(text_data)
>>> my_vocab = gluonnlp.Vocab(counter)  # Use default unknown token '<unk>'
>>> my_vocab.idx_to_token
['<unk>', '<pad>', '<bos>', '<eos>', 'a', 'b']
>>> my_vocab['x']  # any unknown token is mapped to '<unk>'
0

# Case 2: '<unk>' in the text data SHOULD NOT be considered as the representation for any unknown token

# Say, now we want any unknown token to map to a token other than '<unk>' in text_data
>>> my_vocab = gluonnlp.Vocab(counter, unknown_token='<<UnK>>')
>>> my_vocab.idx_to_token
['<<UnK>>', '<pad>', '<bos>', '<eos>', '<unk>', 'a', 'b']
>>> my_vocab['x']  # any unknown token is mapped to '<<UnK>>'
0
>>> my_vocab['<unk>']  # '<unk>' is a common token
4
khui commented 5 years ago

@astonzhang Thanks!

But I agree it is mostly about the logic behind the design when using pre-trained embedding for an established vocabulary. Namely, one may not want to treat the word from the vocabulary that does not present in a pre-trained embedding as <unk> (from the embedding).

One suggestion is to update the Vocab.set_embedding(self, *embeddings). In particular, for the line new_idx_to_vec[1:, col_start:col_end] = embs[self._idx_to_token[1:]], check first the existence of the terms in embs before assignment.

        for embs in embeddings:
            if embs and embs.idx_to_vec is not None:
                col_end = col_start + embs.idx_to_vec.shape[1]
                # Cancatenate vectors of the unknown token.
                new_idx_to_vec[0, col_start:col_end] = embs.idx_to_vec[0]
                new_idx_to_vec[1:, col_start:col_end] = embs[self._idx_to_token[1:]]
                col_start = col_end
leezu commented 5 years ago

Hi @khui, thanks for raising this issue. Actually init_unknown_vec is not part of the public API of TokenEmbedding but only used in it's constructor. Thus we cannot use init_unknown_vec inside Vocab.set_embedding. Your proposal in https://github.com/dmlc/gluon-nlp/issues/680#issuecomment-486740312 is currently the only supported way.

Do you see any issue with your proposal https://github.com/dmlc/gluon-nlp/issues/680#issuecomment-486740312? To make that easier, we could add a naive unknown lookup class to our API:

class UnkLookup(object):
    def __init__(self, init_vec=nd.random.uniform, embedding_dim=50):
        self.init_vec = init_vec
        self.embedding_dim = embedding_dim
        self.token_vec = {}

    def __getitem__(self, token):
        if token not in self.token_vec:
            self.token_vec[token] = self.init_vec(shape=(self.embedding_dim,))
        return self.token_vec[token]

    def __contains__(self, token):
        // we can construct a vector for any token.
        return True

Then your workaround will reduce to two statements (instead of requiring a class definition).

I agree with you that it is not intuitive that vocab.set_embedding will use the same vector for every unknown token if the provided embedding has no unknown_lookup. Unfortunately any change will require breaking backwards compatibility.

In https://github.com/dmlc/gluon-nlp/pull/750/ I recently began work to refactor the TokenEmbedding class. Currently all changes are backwards compatible, but we may consider breaking the API and removing the current init_unknown_vec functionality in favor of only using unknown_lookup. This would solve your use-case, as it would essentially make your proposed workaround the default behavior.

@szha @astonzhang @eric-haibin-lin I'd suggest to remove unknown_token completely from the TokenEmbedding class and replace it's behavior by a UnkLookup that returns the same, fixed vector for any token. In this case also the init_unknown_vec constructor argument would be removed and in essensce the else branch of the constructor introduced in https://github.com/dmlc/gluon-nlp/pull/750 could be removed, keeping only (with slight extensions):

https://github.com/dmlc/gluon-nlp/blob/master/src/gluonnlp/embedding/token_embedding.py#L210-L239

Let's discuss if you have any concerns or other proposals.

szha commented 5 years ago

I think it makes total sense to use a unified way for unknown token look-up. Just keep in mind that we should pay attention to how we deprecate APIs. I started a discussion on this topic in #790.