dmlc / gluon-nlp

NLP made easy
https://nlp.gluon.ai/
Apache License 2.0
2.56k stars 538 forks source link

Add embedding related methods in numpy version #1263

Closed acphile closed 4 years ago

acphile commented 4 years ago

Description

Create embedding related methods in 'gluonnlp.embedding':

embed_loader.list_sources: Get valid token embedding names and their pre-trained file names. embed_loader.load_embeddings: Load pretrained embedding file to build an embedding matrix for a given Vocab. evaluation.CosineSimilarity: a function to compute the cosine similarity. evaluation.HyperbolicCosineSimilarity: a function to compute the cosine similarity in the Hyperbolic space. evaluation.ThreeCosAdd: a Class for 3CosAdd analogy. evaluation.ThreeCosMul: a Class for 3CosMul analogy.

About evaluation

Currently the implementations of embedding.evaluation are not very satisfactory. Suggestions are welcome.

Checklist

Essentials

Changes

Comments

cc @dmlc/gluon-nlp-team

codecov[bot] commented 4 years ago

Codecov Report

Merging #1263 into numpy will increase coverage by 0.23%. The diff coverage is 83.04%.

Impacted file tree graph

@@            Coverage Diff             @@
##            numpy    #1263      +/-   ##
==========================================
+ Coverage   82.44%   82.67%   +0.23%     
==========================================
  Files          38       41       +3     
  Lines        5450     5702     +252     
==========================================
+ Hits         4493     4714     +221     
- Misses        957      988      +31     
Impacted Files Coverage Δ
src/gluonnlp/embedding/embed_loader.py 81.52% <81.52%> (ø)
src/gluonnlp/__init__.py 100.00% <100.00%> (ø)
src/gluonnlp/attention_cell.py 79.74% <100.00%> (-0.26%) :arrow_down:
src/gluonnlp/embedding/__init__.py 100.00% <100.00%> (ø)
src/gluonnlp/embedding/_constants.py 100.00% <100.00%> (ø)
src/gluonnlp/op.py 60.00% <100.00%> (+2.10%) :arrow_up:
src/gluonnlp/models/roberta.py 88.78% <0.00%> (-4.48%) :arrow_down:
src/gluonnlp/models/xlmr.py 86.88% <0.00%> (-1.12%) :arrow_down:
src/gluonnlp/layers.py 86.78% <0.00%> (-0.45%) :arrow_down:
src/gluonnlp/models/transformer_xl.py 82.71% <0.00%> (-0.22%) :arrow_down:
... and 16 more
sxjscience commented 4 years ago

Is it possible to get the embedding of words in raw text if it’s a HybridBlock? We may want to calculate the embedding from raw text or out-of-vocabulary words. which is the purpose of FastText.

Get Outlook for iOShttps://aka.ms/o0ukef


From: acphile notifications@github.com Sent: Saturday, July 18, 2020 2:13:41 AM To: dmlc/gluon-nlp gluon-nlp@noreply.github.com Cc: Xingjian SHI xshiab@connect.ust.hk; Comment comment@noreply.github.com Subject: Re: [dmlc/gluon-nlp] Add embedding related methods in numpy version (#1263)

@acphile commented on this pull request.


In src/gluonnlp/embedding/embed_loader.pyhttps://github.com/dmlc/gluon-nlp/pull/1263#discussion_r456769686:

  • for cls_name, embedding_cls in text_embedding_reg.items():
  • if pretrained_name_or_dir in embedding_cls:
  • source = pretrained_name_or_dir
  • embedding_dir = os.path.join(root_path, cls_name)
  • file_name, file_hash = embedding_cls[source]
  • url = _get_file_url(cls_name, file_name)
  • file_path = os.path.join(embedding_dir, file_name)
  • if not os.path.exists(file_path) or not check_sha1(file_path, file_hash):
  • logging.info('Embedding file {} is not found. Downloading from Gluon Repository. '
  • 'This may take some time.'.format(file_name))
  • download(url, file_path, sha1_hash=file_hash)
  • return file_path
  • return None
  • +def load_embeddings(vocab, pretrained_name_or_dir='glove.6B.50d', unknown='',

I think we can set a base class Class EmbeddingModel(HybridBlock) to serve as the base class of embedding models. And we can attach some evaluation functions to this class. For just loading a embedding matrix, we can simply use the current load_embeddings for users to manually set_data, or have Class WordEmbedding(EmbeddingModel) and move the functionality of load_embeddings to this class. And complex embedding models like FastText, character-level CNN can be further implemented based on EmbeddingModel. This embedding models may be implemented in models/

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/dmlc/gluon-nlp/pull/1263#discussion_r456769686, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABHQH3QCI7C27JYYSU5VEKLR4FRULANCNFSM4O2EOW6Q.

acphile commented 4 years ago

Is it possible to get the embedding of words in raw text if it’s a HybridBlock? We may want to calculate the embedding from raw text or out-of-vocabulary words. which is the purpose of FastText. Get Outlook for iOShttps://aka.ms/o0ukef ____ From: acphile notifications@github.com Sent: Saturday, July 18, 2020 2:13:41 AM To: dmlc/gluon-nlp gluon-nlp@noreply.github.com Cc: Xingjian SHI xshiab@connect.ust.hk; Comment comment@noreply.github.com Subject: Re: [dmlc/gluon-nlp] Add embedding related methods in numpy version (#1263) @acphile commented on this pull request. ____ In src/gluonnlp/embedding/embed_loader.py<#1263 (comment)>:

  • for cls_name, embedding_cls in text_embedding_reg.items():
  • if pretrained_name_or_dir in embedding_cls: + source = pretrained_name_or_dir + embedding_dir = os.path.join(root_path, cls_name) + file_name, file_hash = embedding_cls[source] + url = _get_file_url(cls_name, file_name) + file_path = os.path.join(embedding_dir, file_name) + if not os.path.exists(file_path) or not check_sha1(file_path, file_hash): + logging.info('Embedding file {} is not found. Downloading from Gluon Repository. ' + 'This may take some time.'.format(file_name)) + download(url, file_path, sha1_hash=file_hash) + return file_path + + return None + +def load_embeddings(vocab, pretrained_name_or_dir='glove.6B.50d', unknown='', I think we can set a base class Class EmbeddingModel(HybridBlock) to serve as the base class of embedding models. And we can attach some evaluation functions to this class. For just loading a embedding matrix, we can simply use the current load_embeddings for users to manually set_data, or have Class WordEmbedding(EmbeddingModel) and move the functionality of load_embeddings to this class. And complex embedding models like FastText, character-level CNN can be further implemented based on EmbeddingModel. This embedding models may be implemented in models/ — You are receiving this because you commented. Reply to this email directly, view it on GitHub<#1263 (comment)>, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABHQH3QCI7C27JYYSU5VEKLR4FRULANCNFSM4O2EOW6Q.

For getting the embeddings for unknown words, there are the following situations:

1. words in the vocabulary but not in the embedding file

The default method is to sample from normal distribution. And users can use unk_method to define your own way. I made some simple tests that fasttext.cc actually computes faster than the original gluon approach in v0.9.x, so you simply can do as follows:

fast = fasttext.load_model('model.bin')
def ngram(words):
     return np.array([fast[word] for word in words])
embedding_matrix = load_embeddings(vocab, source, unk_method=ngram)

In this case, we get an embedding matrix for a given vocabulary. Now I have added the feature that use the embedding file as the vocabulary:

embedding_matrix, vocab = load_embeddings(vocab=None, pretrained_name_or_dir=source)

2. words not in the vocabulary

I think generally we just use the embedding of <unk> for these OOV words. Of course we can still generate some initial embedding vectors by FastText, but since they are not updated during training, I don't think it is very useful.

To further maintain the information from these words, in practice we may use some character-level NN. For example, we may use a character-level CNN to compute the embedding of a word and the parameters are learnable during training. That's why I think we should create a base class EmbeddingModel(HybridBlock) and have some embedding models (in the other words, part of the Neural Network) so that it would be easier for users to build their NLP models. EmbeddingModel (and its children class) can serve as a black box: Inside there can be only a simple embedding lookup or some NN layers. What we want is to input List[word index] or List[List[character index] (maybe both or others) and get the word representations which can even be contextualized representations. So we can set the embedding evaluation functions as the class methods in a more general way.

sxjscience commented 4 years ago

The advantage of fasttext is that there is no need to care about OOV words. Thus, you may need to still offer this functionality.

sxjscience commented 4 years ago

@acphile This is the advantage of using subwords. Basically, there will be no/less number of OOV words if you are using a subword repsentation. For example, GPT-2/GPT-3 chose to use the byte-based BPE encoding because there will never be OOV words. Also, you may check Section 2.1 of https://arxiv.org/pdf/1911.03688.pdf to see how different models may adopt different strategies for dealing with the OOV problem.

acphile commented 4 years ago

@acphile This is the advantage of using subwords. Basically, there will be no/less number of OOV words if you are using a subword repsentation. For example, GPT-2/GPT-3 chose to use the byte-based BPE encoding because there will never be OOV words. Also, you may check Section 2.1 of https://arxiv.org/pdf/1911.03688.pdf to see how different models may adopt different strategies for dealing with the OOV problem.

I understand that and in my context, vocabulary not only refers to vocabulary of a word ('vocabulary') but also refers to a lookup dict which records the different tokens for a certain input. For example, a vocabulary of trigram records the most trigrams which occurs in the dataset. Since raw text can be transformed to several different types of inputs for models (like List[word], List[List[ngram]],List[BPE]) by tokenizer and further transformed to List[int] or List[List[int]] by Vocab. For embedding part, we only need to handle with integers and that's why previous I suggest using EmbeddingModel(HybridBlock). Of course FastText is very useful and I think it is better to implement FastText as Class FastText(EmbeddingModel) instead of Class FastText:. It is somewhat like https://github.com/dmlc/gluon-nlp/blob/v0.9.x/src/gluonnlp/model/train/embedding.py#L175 but I think we could improve its implementation.

sxjscience commented 4 years ago

@acphile The problem is that it will be inefficient to have the tokenizer output all the ngram combinations. Instead, you ask the tokenizer to output a list of tokens and each token will be converted to the embedding.

sxjscience commented 4 years ago

@acphile Is it possible to also refer to the implementation in gensim https://radimrehurek.com/gensim/models/fasttext.html#module-gensim.models.fasttext?

acphile commented 4 years ago

@acphile The problem is that it will be inefficient to have the tokenizer output all the ngram combinations. Instead, you ask the tokenizer to output a list of tokens and each token will be converted to the embedding.

For each token, gensim still output all ngrams to compute the corresponding embeddings: https://github.com/RaRe-Technologies/gensim/blob/c0e0169565116854993b22efef29e3c402ec6c69/gensim/models/fasttext_inner.pyx#L672 And they use hash buckets for converting ngrams to indexes: https://github.com/RaRe-Technologies/gensim/blob/c0e0169565116854993b22efef29e3c402ec6c69/gensim/models/fasttext.py#L1289 I think maybe we can make a supplement for hash lookup in vocab.

sxjscience commented 4 years ago

I think it’s better to do it internally and the user may not need to care about it. Basically, we need a way to map raw text tokens to embedding vectors.

Get Outlook for iOShttps://aka.ms/o0ukef


From: phile notifications@github.com Sent: Tuesday, July 21, 2020 4:06:26 AM To: dmlc/gluon-nlp gluon-nlp@noreply.github.com Cc: Xingjian SHI xshiab@connect.ust.hk; Comment comment@noreply.github.com Subject: Re: [dmlc/gluon-nlp] Add embedding related methods in numpy version (#1263)

@acphilehttps://github.com/acphile The problem is that it will be inefficient to have the tokenizer output all the ngram combinations. Instead, you ask the tokenizer to output a list of tokens and each token will be converted to the embedding.

For each token, gensim still output all ngrams to compute the corresponding embeddings: https://github.com/RaRe-Technologies/gensim/blob/c0e0169565116854993b22efef29e3c402ec6c69/gensim/models/fasttext_inner.pyx#L672 And they use hash buckets for converting ngrams to indexes: https://github.com/RaRe-Technologies/gensim/blob/c0e0169565116854993b22efef29e3c402ec6c69/gensim/models/fasttext.py#L1289 I think maybe we can make a supplement for hash lookup in vocab.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/dmlc/gluon-nlp/pull/1263#issuecomment-661790310, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABHQH3RJXLSXID2G23UIHPDR4VZDFANCNFSM4O2EOW6Q.

sxjscience commented 4 years ago

Can you add some tests in https://github.com/dmlc/gluon-nlp/tree/numpy/tests?