Closed acphile closed 4 years ago
Merging #1263 into numpy will increase coverage by
0.23%
. The diff coverage is83.04%
.
@@ Coverage Diff @@
## numpy #1263 +/- ##
==========================================
+ Coverage 82.44% 82.67% +0.23%
==========================================
Files 38 41 +3
Lines 5450 5702 +252
==========================================
+ Hits 4493 4714 +221
- Misses 957 988 +31
Impacted Files | Coverage Δ | |
---|---|---|
src/gluonnlp/embedding/embed_loader.py | 81.52% <81.52%> (ø) |
|
src/gluonnlp/__init__.py | 100.00% <100.00%> (ø) |
|
src/gluonnlp/attention_cell.py | 79.74% <100.00%> (-0.26%) |
:arrow_down: |
src/gluonnlp/embedding/__init__.py | 100.00% <100.00%> (ø) |
|
src/gluonnlp/embedding/_constants.py | 100.00% <100.00%> (ø) |
|
src/gluonnlp/op.py | 60.00% <100.00%> (+2.10%) |
:arrow_up: |
src/gluonnlp/models/roberta.py | 88.78% <0.00%> (-4.48%) |
:arrow_down: |
src/gluonnlp/models/xlmr.py | 86.88% <0.00%> (-1.12%) |
:arrow_down: |
src/gluonnlp/layers.py | 86.78% <0.00%> (-0.45%) |
:arrow_down: |
src/gluonnlp/models/transformer_xl.py | 82.71% <0.00%> (-0.22%) |
:arrow_down: |
... and 16 more |
Is it possible to get the embedding of words in raw text if it’s a HybridBlock? We may want to calculate the embedding from raw text or out-of-vocabulary words. which is the purpose of FastText.
Get Outlook for iOShttps://aka.ms/o0ukef
From: acphile notifications@github.com Sent: Saturday, July 18, 2020 2:13:41 AM To: dmlc/gluon-nlp gluon-nlp@noreply.github.com Cc: Xingjian SHI xshiab@connect.ust.hk; Comment comment@noreply.github.com Subject: Re: [dmlc/gluon-nlp] Add embedding related methods in numpy version (#1263)
@acphile commented on this pull request.
In src/gluonnlp/embedding/embed_loader.pyhttps://github.com/dmlc/gluon-nlp/pull/1263#discussion_r456769686:
- for cls_name, embedding_cls in text_embedding_reg.items():
- if pretrained_name_or_dir in embedding_cls:
- source = pretrained_name_or_dir
- embedding_dir = os.path.join(root_path, cls_name)
- file_name, file_hash = embedding_cls[source]
- url = _get_file_url(cls_name, file_name)
- file_path = os.path.join(embedding_dir, file_name)
- if not os.path.exists(file_path) or not check_sha1(file_path, file_hash):
- logging.info('Embedding file {} is not found. Downloading from Gluon Repository. '
- 'This may take some time.'.format(file_name))
- download(url, file_path, sha1_hash=file_hash)
- return file_path
- return None
+def load_embeddings(vocab, pretrained_name_or_dir='glove.6B.50d', unknown='
',
I think we can set a base class Class EmbeddingModel(HybridBlock) to serve as the base class of embedding models. And we can attach some evaluation functions to this class. For just loading a embedding matrix, we can simply use the current load_embeddings for users to manually set_data, or have Class WordEmbedding(EmbeddingModel) and move the functionality of load_embeddings to this class. And complex embedding models like FastText, character-level CNN can be further implemented based on EmbeddingModel. This embedding models may be implemented in models/
— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/dmlc/gluon-nlp/pull/1263#discussion_r456769686, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABHQH3QCI7C27JYYSU5VEKLR4FRULANCNFSM4O2EOW6Q.
Is it possible to get the embedding of words in raw text if it’s a HybridBlock? We may want to calculate the embedding from raw text or out-of-vocabulary words. which is the purpose of FastText. Get Outlook for iOShttps://aka.ms/o0ukef ____ From: acphile notifications@github.com Sent: Saturday, July 18, 2020 2:13:41 AM To: dmlc/gluon-nlp gluon-nlp@noreply.github.com Cc: Xingjian SHI xshiab@connect.ust.hk; Comment comment@noreply.github.com Subject: Re: [dmlc/gluon-nlp] Add embedding related methods in numpy version (#1263) @acphile commented on this pull request. ____ In src/gluonnlp/embedding/embed_loader.py<#1263 (comment)>:
- for cls_name, embedding_cls in text_embedding_reg.items():
- if pretrained_name_or_dir in embedding_cls: + source = pretrained_name_or_dir + embedding_dir = os.path.join(root_path, cls_name) + file_name, file_hash = embedding_cls[source] + url = _get_file_url(cls_name, file_name) + file_path = os.path.join(embedding_dir, file_name) + if not os.path.exists(file_path) or not check_sha1(file_path, file_hash): + logging.info('Embedding file {} is not found. Downloading from Gluon Repository. ' + 'This may take some time.'.format(file_name)) + download(url, file_path, sha1_hash=file_hash) + return file_path + + return None + +def load_embeddings(vocab, pretrained_name_or_dir='glove.6B.50d', unknown='
', I think we can set a base class Class EmbeddingModel(HybridBlock) to serve as the base class of embedding models. And we can attach some evaluation functions to this class. For just loading a embedding matrix, we can simply use the current load_embeddings for users to manually set_data, or have Class WordEmbedding(EmbeddingModel) and move the functionality of load_embeddings to this class. And complex embedding models like FastText, character-level CNN can be further implemented based on EmbeddingModel. This embedding models may be implemented in models/ — You are receiving this because you commented. Reply to this email directly, view it on GitHub<#1263 (comment)>, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABHQH3QCI7C27JYYSU5VEKLR4FRULANCNFSM4O2EOW6Q.
For getting the embeddings for unknown words, there are the following situations:
The default method is to sample from normal distribution. And users can use unk_method
to define your own way. I made some simple tests that fasttext.cc
actually computes faster than the original gluon approach in v0.9.x, so you simply can do as follows:
fast = fasttext.load_model('model.bin')
def ngram(words):
return np.array([fast[word] for word in words])
embedding_matrix = load_embeddings(vocab, source, unk_method=ngram)
In this case, we get an embedding matrix for a given vocabulary. Now I have added the feature that use the embedding file as the vocabulary:
embedding_matrix, vocab = load_embeddings(vocab=None, pretrained_name_or_dir=source)
I think generally we just use the embedding of <unk>
for these OOV words. Of course we can still generate some initial embedding vectors by FastText
, but since they are not updated during training, I don't think it is very useful.
To further maintain the information from these words, in practice we may use some character-level NN. For example, we may use a character-level CNN to compute the embedding of a word and the parameters are learnable during training. That's why I think we should create a base class EmbeddingModel(HybridBlock)
and have some embedding models (in the other words, part of the Neural Network) so that it would be easier for users to build their NLP models.
EmbeddingModel
(and its children class) can serve as a black box: Inside there can be only a simple embedding lookup or some NN layers. What we want is to input List[word index] or List[List[character index] (maybe both or others) and get the word representations which can even be contextualized representations. So we can set the embedding evaluation functions as the class methods in a more general way.
The advantage of fasttext is that there is no need to care about OOV words. Thus, you may need to still offer this functionality.
@acphile This is the advantage of using subwords. Basically, there will be no/less number of OOV words if you are using a subword repsentation. For example, GPT-2/GPT-3 chose to use the byte-based BPE encoding because there will never be OOV words. Also, you may check Section 2.1 of https://arxiv.org/pdf/1911.03688.pdf to see how different models may adopt different strategies for dealing with the OOV problem.
@acphile This is the advantage of using subwords. Basically, there will be no/less number of OOV words if you are using a subword repsentation. For example, GPT-2/GPT-3 chose to use the byte-based BPE encoding because there will never be OOV words. Also, you may check Section 2.1 of https://arxiv.org/pdf/1911.03688.pdf to see how different models may adopt different strategies for dealing with the OOV problem.
I understand that and in my context, vocabulary
not only refers to vocabulary of a word ('vocabulary') but also refers to a lookup dict which records the different tokens for a certain input. For example, a vocabulary of trigram records the most trigrams which occurs in the dataset. Since raw text can be transformed to several different types of inputs for models (like List[word], List[List[ngram]],List[BPE]) by tokenizer
and further transformed to List[int] or List[List[int]] by Vocab
. For embedding part, we only need to handle with integers and that's why previous I suggest using EmbeddingModel(HybridBlock)
. Of course FastText
is very useful and I think it is better to implement FastText
as Class FastText(EmbeddingModel)
instead of Class FastText:
. It is somewhat like https://github.com/dmlc/gluon-nlp/blob/v0.9.x/src/gluonnlp/model/train/embedding.py#L175 but I think we could improve its implementation.
@acphile The problem is that it will be inefficient to have the tokenizer output all the ngram combinations. Instead, you ask the tokenizer to output a list of tokens and each token will be converted to the embedding.
@acphile Is it possible to also refer to the implementation in gensim https://radimrehurek.com/gensim/models/fasttext.html#module-gensim.models.fasttext?
@acphile The problem is that it will be inefficient to have the tokenizer output all the ngram combinations. Instead, you ask the tokenizer to output a list of tokens and each token will be converted to the embedding.
For each token, gensim still output all ngrams to compute the corresponding embeddings: https://github.com/RaRe-Technologies/gensim/blob/c0e0169565116854993b22efef29e3c402ec6c69/gensim/models/fasttext_inner.pyx#L672
And they use hash buckets for converting ngrams to indexes: https://github.com/RaRe-Technologies/gensim/blob/c0e0169565116854993b22efef29e3c402ec6c69/gensim/models/fasttext.py#L1289
I think maybe we can make a supplement for hash lookup in vocab
.
I think it’s better to do it internally and the user may not need to care about it. Basically, we need a way to map raw text tokens to embedding vectors.
Get Outlook for iOShttps://aka.ms/o0ukef
From: phile notifications@github.com Sent: Tuesday, July 21, 2020 4:06:26 AM To: dmlc/gluon-nlp gluon-nlp@noreply.github.com Cc: Xingjian SHI xshiab@connect.ust.hk; Comment comment@noreply.github.com Subject: Re: [dmlc/gluon-nlp] Add embedding related methods in numpy version (#1263)
@acphilehttps://github.com/acphile The problem is that it will be inefficient to have the tokenizer output all the ngram combinations. Instead, you ask the tokenizer to output a list of tokens and each token will be converted to the embedding.
For each token, gensim still output all ngrams to compute the corresponding embeddings: https://github.com/RaRe-Technologies/gensim/blob/c0e0169565116854993b22efef29e3c402ec6c69/gensim/models/fasttext_inner.pyx#L672 And they use hash buckets for converting ngrams to indexes: https://github.com/RaRe-Technologies/gensim/blob/c0e0169565116854993b22efef29e3c402ec6c69/gensim/models/fasttext.py#L1289 I think maybe we can make a supplement for hash lookup in vocab.
— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/dmlc/gluon-nlp/pull/1263#issuecomment-661790310, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABHQH3RJXLSXID2G23UIHPDR4VZDFANCNFSM4O2EOW6Q.
Can you add some tests in https://github.com/dmlc/gluon-nlp/tree/numpy/tests?
Description
Create embedding related methods in 'gluonnlp.embedding':
embed_loader.list_sources
: Get valid token embedding names and their pre-trained file names.embed_loader.load_embeddings
: Load pretrained embedding file to build an embedding matrix for a givenVocab
.evaluation.CosineSimilarity
: a function to compute the cosine similarity.evaluation.HyperbolicCosineSimilarity
: a function to compute the cosine similarity in the Hyperbolic space.evaluation.ThreeCosAdd
: a Class for 3CosAdd analogy.evaluation.ThreeCosMul
: a Class for 3CosMul analogy.About evaluation
Currently the implementations of
embedding.evaluation
are not very satisfactory. Suggestions are welcome.Checklist
Essentials
Changes
Comments
cc @dmlc/gluon-nlp-team