Closed stefan-it closed 5 years ago
Awesome - really look forward to supporting this in Flair!
pytorch-transformers 1.0 was released today: at https://github.com/huggingface/pytorch-transformers
Migrations summary can be found on the readme here
Main takeaways from the migration process are:
saved_pretrained(save_directory)
, and models are in evaluation mode by default and must be set to training mode during trainingOtherwise, pytorch-transformers 1.0 looks great and looking forward to use them on flair as well as standalone
Looks great!
Using the last_hidden_states
works e.g. for Transformer-XL, but badly fails for XLNet and OpenGPT-2. As we use a feature-based approach, I'm currently doing some extensive per-layer analysis for all of the architectures. I'll post the results here (I'm mainly using a 0.1 downsampled CoNLL-2003 English corpus for NER).
I ran some per-layer analysis with the large XLNet model:
Layer | F-Score (0.1 downsampled CoNLL-2003 NER corpus) |
---|---|
1 | 82.91 |
2 | 81.94 |
3 | 80.10 |
4 | 82.62 |
5 | 84.16 |
6 | 79.19 |
7 | 80.76 |
8 | 81.85 |
9 | 82.64 |
10 | 74.29 |
11 | 78.99 |
12 | 79.34 |
13 | 76.22 |
14 | 79.67 |
15 | 77.07 |
16 | 73.49 |
17 | 73.20 |
18 | 74.36 |
19 | 72.32 |
20 | 71.30 |
21 | 74.97 |
22 | 75.04 |
23 | 66.84 |
24 | 03.37 |
XLNetEmbeddings
To use the new XLNet embeddings in flair
just do:
from flair.data import Sentence
from flair.embeddings import XLNetEmbeddings
embeddings = XLNetEmbeddings()
s = Sentence("Berlin and Munich are nice cities .")
embeddings.embed(s)
# Get embeddings
for token in s.tokens:
print(token.embedding)
XLNetEmbeddings
has two parameters:
model
: just specify the XLNet model. pytorch-transformers
currenly comes with xlnet-large-cased
and xlnet-base-cased
layers
: comma-separated string of layers. Default is 1
, to use more layers (will be concatenated then) just pass: 1,2,3,4
pooling_operation
: defines the pooling operation of subwords. By default first and last subword embeddings are concatenated and used. Other pooling operations are also available: first
, last
and mean
I also ran some per-layer analysis for the Transformer-XL embeddings:
Layer | F-Score (0.1 downsampled CoNLL-2003 NER corpus) |
---|---|
1 | 80.88 |
2 | 81.68 |
3 | 82.88 |
4 | 80.89 |
5 | 84.74 |
6 | 80.68 |
7 | 82.65 |
8 | 79.53 |
9 | 79.25 |
10 | 79.64 |
11 | 80.07 |
12 | 84.26 |
13 | 81.22 |
14 | 80.59 |
15 | 81.31 |
16 | 78.95 |
17 | 79.85 |
18 | 80.69 |
Experiments with combination of layers:
Layers | F-Score (0.1 downsampled CoNLL-2003 NER corpus) |
---|---|
1,2 | 80.84 |
1,2,3 | 81.99 |
1,2,3,4 | 78.44 |
1,2,3,4,5 | 80.89 |
That's the reason why I choose layers 1,2,3
as default for the TransformerXLEmbeddings
for now.
TransformerXLEmbeddings
TransformerXLEmbeddings
has two parameters:
model
: just specify the Transformer-XL model. pytorch-transformers
currenly comes with transfo-xl-wt103
layers
: comma-separated string of layers. Default are 1,2,3
, to use more layers (will be concatenated then) just pass: 1,2,3,4
@alanakbik I'm planning to run per-layer analysis for all of the Transformer-based models.
However, it is really hard to give a recommendation for default layer(s).
Recently, I found this NAACL paper: Linguistic Knowledge and Transferability of Contextual Representations that uses a "scalar mix" of all layers. An implementation can be found in the allennlp
repo, see it here. Do you have any idea how we can use this technique here 🤔
Would be awesome if we can adopt that 🤗
@stefan-it from the paper it also seems that best approach / layers would vary by task so giving overall recommendations might be really difficult. From a quick look it seems like their implementation could be integrated as part of an embeddings class, i.e. after retrieving the layers put them through this code. Might be interesting to try out!
Here somes some per-layer analysis for the first GPT model:
Layer | F-Score (0.1 downsampled CoNLL-2003 NER corpus) |
---|---|
1 | 74.90 |
2 | 66.93 |
3 | 64.62 |
4 | 67.62 |
5 | 62.01 |
6 | 58.19 |
7 | 53.13 |
8 | 55.19 |
9 | 52.61 |
10 | 52.02 |
11 | 64.59 |
12 | 70.08 |
I implemented a first prototype of the scalar mix approach. I was able to get a F-Score of 71.01 (over all layers, incl. word embedding layer)!
OpenAIGPTEmbeddings
The OpenAIGPTEmbeddings
comes with three parameters: model
, layers
and pooling_operation
.
XLNet
I ran some per-layer analysis with the large XLNet model:
Layer F-Score (0.1 downsampled CoNLL-2003 NER corpus) 1 82.91 2 81.94 3 80.10 4 82.62 5 84.16 6 79.19 7 80.76 8 81.85 9 82.64 10 74.29 11 78.99 12 79.34 13 76.22 14 79.67 15 77.07 16 73.49 17 73.20 18 74.36 19 72.32 20 71.30 21 74.97 22 75.04 23 66.84 24 03.37
XLNetEmbeddings
To use the new XLNet embeddings in
flair
just do:from flair.data import Sentence from flair.embeddings import XLNetEmbeddings embeddings = XLNetEmbeddings() s = Sentence("Berlin and Munich are nice cities .") embeddings.embed(s) # Get embeddings for token in s.tokens: print(token.embeddings)
XLNetEmbeddings
has two parameters:
model
: just specify the XLNet model.pytorch-transformers
currenly comes withxlnet-large-cased
andxlnet-base-cased
layers
: comma-separated string of layers. Default is1
, to use more layers (will be concatenated then) just pass:1,2,3,4
pooling_operation
: defines the pooling operation of subwords. By default first and last subword embeddings are concatenated and used. Other pooling operations are also available:first
,last
andmean
Hi, I use branch GH-873-pytorch-transformers
and try it. But it raised an error:
AttributeError: 'Token' object has no attribute 'embeddings'
Not able to import
ImportError: cannot import name 'XLNetEmbeddings'
Any suggestions?
Not able to import
ImportError: cannot import name 'XLNetEmbeddings'
Any suggestions?
You need to change branch to GH-873
Thanks for the quick reply, ill check
@nullphantom just use:
for token in s.tokens:
print(token.embedding)
:)
@stefan-it interesting results with the scalar mix! How is the effect on runtime, i.e. for instance comparing scalar mix with only one layer?
I ran some per-layer experiments on the GPT-2 and the GPT-2 medium model:
Layer | GPT-2 F-Score (0.1 downsampled CoNLL-2003 NER corpus) |
GPT-2 medium F-Score (0.1 downsampled CoNLL-2003 NER corpus) |
---|---|---|
1 | 42.41 | 45.58 |
2 | 10.26 | 48.52 |
3 | 15.20 | 2.17 |
4 | 22.51 | 18.50 |
5 | 0.00 | 16.22 |
6 | 21.71 | 8.03 |
7 | 12.70 | 15.85 |
8 | 14.10 | 17.74 |
9 | 0.00 | 6.70 |
10 | 18.75 | 0.00 |
11 | 0.00 | 3.22 |
12 | 5.62 | 11.18 |
13 | 17.09 | |
14 | 14.25 | |
15 | 0.00 | |
16 | 7.02 | |
17 | 8.03 | |
18 | 0.00 | |
19 | 0.00 | |
20 | 9.49 | |
21 | 10.65 | |
22 | 8.38 | |
23 | 18.74 | |
24 | 5.15 |
It does not look very promising, so scalar mix could help here!
OpenAIGPT2Embeddings
To play around with the embeddings from the GPT-2 models, just use:
from flair.data import Sentence
from flair.embeddings import OpenAIGPT2Embeddings
embeddings = OpenAIGPT2Embeddings()
s = Sentence("Berlin and Munich")
embeddings.embed(s)
for token in s.tokens:
print(token.embedding)
@alanakbik In my preliminary experiments with scalar mix implementation, I couldn't see any big performance issues, but I'll measure it whenever the implementation is ready.
I'm currently focussing on per-layer analysis for the XLM model :)
@stefan-it cool! Really looking forward to XLM! Strange that GPT-2 is not doing so well.
Here are the results from a per-layer analysis for the English XLM model:
Layer | F-Score (0.1 downsampled CoNLL-2003 NER corpus) |
---|---|
1 | 76.92 |
2 | 75.91 |
3 | 75.61 |
4 | 73.52 |
5 | 73.66 |
6 | 70.75 |
7 | 70.90 |
8 | 63.58 |
9 | 64.04 |
10 | 57.38 |
11 | 54.70 |
12 | 56.96 |
XLMEmbeddings
The following snippet demonstrates the usage of the newXLMEmbeddings
class:
from flair.data import Sentence
from flair.embeddings import XLMEmbeddings
embeddings = XLMEmbeddings()
s = Sentence("It is very hot in Munich now .")
embeddings.embed(s)
for token in s.tokens:
print(token.embedding)
I'm currently updating the BertEmbeddings
class to the new pytorch-transformers
API.
Btw: using scalar mix does not help when using the OpenAIGPT2Embeddings
😞
Cool thanks for sharing! Really interesting to see how all these approaches fare. So far, XLNet seems to be doing best at least with individual layers.
I adjusted the BertEmbeddings
class to make it compatible with the new pytorch-transformers
API.
In order to avoid any regression bugs, I compared the performance with the old pytorch-pretrained-BERT
library. Here's the per-layer analysis:
Layer | BERT with pytorch-pretrained-BERT F-Score (0.1 downsampled CoNLL-2003 NER corpus) |
BERT with pytorch-transformers F-Score (0.1 downsampled CoNLL-2003 NER corpus) |
---|---|---|
1 | 80.15 | 81.35 |
2 | 79.49 | 82.65 |
3 | 84.20 | 83.44 |
4 | 83.71 | 84.58 |
5 | 87.71 | 88.81 |
6 | 86.56 | 87.34 |
7 | 87.61 | 87.13 |
8 | 86.67 | 85.20 |
9 | 88.17 | 87.73 |
10 | 89.20 | 85.66 |
11 | 86.65 | 87.44 |
12 | 85.82 | 87.06 |
@alanakbik Can I fill a PR for these new embeddings? Maybe we can define some kind of roadmap for the new introduced Transformer-based embeddings:
@stefan-it absolutely! This is a major upgrade that lots of people will want to use. With all the new features, it's probably time to do another Flair release (v0.4.3), the question is whether we wait for the features you outline or release in the very near future?
I'm going to work a bit on it until next week. As I just found a great suggestion/improve in the pytorch-transformers
repo see issue here, I would make the following code changes:
So I would be better to have a kind of generic base class :)
PR for the "first phase" is coming soon.
I also add support for RoBERTa, see the RoBERTa: A Robustly Optimized BERT Pretraining Approach paper for more information.
RoBERTa is currently not integrated into pytorch-transformers
, so I wrote an embedding class around the torch.hub
module. I tested the model, here a some results for the base model:
Layer | RoBERTa F-Score (0.1 downsampled CoNLL-2003 NER corpus) |
---|---|
1 | 75.16 |
2 | 80.29 |
3 | 81.01 |
4 | 80.41 |
5 | 80.52 |
6 | 80.23 |
7 | 81.31 |
8 | 84.25 |
9 | 80.12 |
10 | 78.09 |
11 | 76.50 |
12 | 81.12 |
The variance for a 0.1 downsampled CoNLL corpus is very high. I did some experiments for RoBERTa in order to compare different pooling operations for subwords using scalar mix:
Pooling operation | Run 1 | Run 2 | Run 3 | Run 4 | Avg. |
---|---|---|---|---|---|
first_last |
76.12 | 76.56 | 79.12 | 79.17 | 77.74 |
first |
78.91 | 81.41 | 76.79 | 80.00 | 79.28 |
I also used the complete CoNLL corpus (one run) with scalar mix:
Pooling operation | F-Score |
---|---|
first_last |
86.97 |
first |
87.40 |
BERT (base) achieves 92.2 (reported in their paper). Now I'm going to run some experiments with BERT (base) and scalar mix to have a better comparison :)
Update: BERT (base) achieves a F-Score of 91.38 on the full CoNLL corpus with scalar mix.
An update:
I've re-written the complete tokenization logic for all Transformer-based embeddings (except BERT). In the old version, I did pass each token for a sentence into the model (which is not very efficient and causes major problems with the GPT-2 tokenizer).
The latest version passes the complete sentence into the model. The embeddings for subwords are then aligned back to each "Flair" token in a sentence (I wrote some unit tests for that...).
I also added the code for scalar mix from the allennlp
repo.
Here are some experiments with the new implemenation on a downsampled (0.1) CoNLL corpus for NER. F-Score is measured and averaged over 4 runs, scalar mix is used:
Model | Pooling | # 1 | # 2 | # 3 | # 4 | Avg. |
---|---|---|---|---|---|---|
RoBERTa (base) | first |
86.34 | 86.30 | 90.21 | 87.28 | 87.53 |
GPT-1 | first_last |
75.21 | 77.53 | 74.90 | 76.33 | 75.99 |
GPT-1 | first |
74.31 | 75.42 | 74.01 | 76.56 | 75.08 |
GPT-2 (medium) | first_last |
85.18 | 76.86 | 79.93 | 81.02 | 80.75 |
GPT-2 (medium) | first |
78.88 | 79.23 | 80.31 | 76.80 | 78.81 |
XLM (en) | first_last |
84.65 | 86.50 | 84.63 | 84.97 | 85.19 |
XLM (en) | first |
86.66 | 88.28 | 87.55 | 85.82 | 87.08 |
Transformer-XL | - | 81.03 | 80.17 | 78.67 | 81.34 | 80.53 |
XLNet (base) | first_last |
85.66 | 88.59 | 85.74 | 87.36 | 86.84 |
XLNet (base) | first |
88.81 | 86.65 | 86.01 | 85.72 | 86.80 |
I'm currently running experiments on the whole CoNLL corpus. Here are some results (only one run):
Model | Pooling | Dev | Test |
---|---|---|---|
BERT (base, cased) | first |
94.74 | 91.38 |
BERT (base, uncased) | first |
94.61 | 91.03 |
BERT (large, cased) | first |
95.23 | 91.69 |
BERT (large, uncased) | first |
94.78 | 91.49 |
BERT (large, cased, whole-word-masking) | first |
94.88 | 91.16 |
BERT (large, uncased, whole-word-masking) | first |
94.94 | 91.20 |
RoBERTa (base) | first |
95.35 | 91.51 |
RoBERTa (large) | first |
95.83 | 92.11 |
RoBERTa (large) | mean |
96.31 | 92.31 |
XLNet (base) | first_last |
94.56 | 90.73 |
XLNet (large) | first_last |
95.47 | 91.49 |
XLNet (large) | first |
95.14 | 91.71 |
XLM (en) | first_last |
94.31 | 90.68 |
XLM (en) | first |
94.00 | 90.73 |
GPT-2 | first_last |
91.35 | 87.47 |
GPT-2 (large) | first_last |
94.09 | 90.63 |
Notice: The feature-based result from the BERT paper is 96.1 (dev) and 92.4 - 92.8 for base and large model (test). But they "include the maximal document context provided by the data". I found an issue in the allennlp
repo (here) and a dev score of 95.3 seems to be possible (without using the document context).
But from these preliminary experiments, RoBERTa (/cc @myleott) seems to perform slightly better at the moment :)
@stefan-it were you using the scalar mix in these experiments on the full CoNLL? Were you always using the default layers as set in the constructor of each class?
I used scalar mix for all layers (incl. word embedding layer, which is located at index 0) on the full CoNLL. E.g. for RoBERTa the init. would be:
emb = RoBERTaEmbeddings(model="roberta.base", layers="0,1,2,3,4,5,6,7,8,9,10,11,12", pooling_operation="first", use_scalar_mix=True)
I'm not sure about the default parameters when using no scalar mix, because there's no literature about that, except BERT (base), where a concat of the last four layers was proposed.
Ah great - ok, I'll run a similar experiment and report numbers. But aside from this, I think we are ready to merge the PR.
BTW here some results with using RoBERTa with default parameters and scalar mix, i.e. instantiated like this:
RoBERTaEmbeddings(use_scalar_mix=True)
Results of three runs:
# 1 | # 2 | # 3 |
---|---|---|
92.03 | 92.05 | 91.96 |
Using otherwise the exact same parameters as here.
Documentation for the new PyTorch-Transformers embeddings are coming very soon :)
I'll close that issue now (PR was merged).
@stefan-it if i do git pull will i be able to access these new additions or do i have to checkout to GH-873? Thanks
You can just use the latest master
branch :) Or install it via:
pip install --upgrade git+https://github.com/zalandoresearch/flair.git
:)
okay thanks !
@stefan-it Even after running pip install --upgrade git+https://github.com/zalandoresearch/flair.git
When i try to import from flair.embeddings import XLNetEmbeddings
i get ImportError: cannot import name 'XLNetEmbeddings'
@DecentMakeover You can try again, I installed the latest version of flair and the problem disappeared. The last commit is before 5 days.
@dshaprin okay ,ill check.
Hi,
the upcoming 1.0 version of
pytorch-pretrained-bert
will introduce several API changes, new models and even a name change topytorch-transformers
.After the final 1.0 release,
flair
could support 7 different Transformer-based architectures:BertEmbeddings
OpenAIGPTEmbeddings
OpenAIGPT2Embeddings
🛡️TransformerXLEmbeddings
XLNetEmbeddings
🛡️XLMEmbeddings
🛡️RoBERTaEmbeddings
🛡️ (currently not covered bypytorch-transformers
)🛡️ indicates a new embedding class for
flair
.It also introduces an universal API for all models, so quite a few changes in
flair
are necessary so support both old and new embedding classes.This issue tracks the implementation status for all 6 embedding classes 😊