stefan-it commented 5 years ago

Hi,

the upcoming 1.0 version of pytorch-pretrained-bert will introduce several API changes, new models and even a name change to pytorch-transformers.

After the final 1.0 release, flair could support 7 different Transformer-based architectures:

[x] BERT -> BertEmbeddings
[x] OpenAI GPT -> OpenAIGPTEmbeddings
[x] OpenAI GPT-2 -> OpenAIGPT2Embeddings 🛡️
[x] Transformer-XL -> TransformerXLEmbeddings
[x] XLNet -> XLNetEmbeddings 🛡️
[x] XLM -> XLMEmbeddings 🛡️
[x] RoBERTa -> RoBERTaEmbeddings 🛡️ (currently not covered by pytorch-transformers)

🛡️ indicates a new embedding class for flair.

It also introduces an universal API for all models, so quite a few changes in flair are necessary so support both old and new embedding classes.

This issue tracks the implementation status for all 6 embedding classes 😊

alanakbik commented 5 years ago

Awesome - really look forward to supporting this in Flair!

aychang95 commented 5 years ago

pytorch-transformers 1.0 was released today: at https://github.com/huggingface/pytorch-transformers

Migrations summary can be found on the readme here

Main takeaways from the migration process are:

Models now output tuples: Seems like the quickest fix is to retrieve the first element of our model outputs with a safe 0 index
Serialization: Serialization methods have been standardized with the saved_pretrained(save_directory), and models are in evaluation mode by default and must be set to training mode during training
Optimizers: BertAdam and OpenAIAdam are now to be replaced with AdamW, and schedulers are not part of the optimizer anymore. Scheduler step must be specified per batch and are now standard PyTorch learning schedulers

Otherwise, pytorch-transformers 1.0 looks great and looking forward to use them on flair as well as standalone

alanakbik commented 5 years ago

Looks great!

stefan-it commented 5 years ago

Using the last_hidden_states works e.g. for Transformer-XL, but badly fails for XLNet and OpenGPT-2. As we use a feature-based approach, I'm currently doing some extensive per-layer analysis for all of the architectures. I'll post the results here (I'm mainly using a 0.1 downsampled CoNLL-2003 English corpus for NER).

stefan-it commented 5 years ago

XLNet

I ran some per-layer analysis with the large XLNet model:

Layer	F-Score (0.1 downsampled CoNLL-2003 NER corpus)
1	82.91
2	81.94
3	80.10
4	82.62
5	84.16
6	79.19
7	80.76
8	81.85
9	82.64
10	74.29
11	78.99
12	79.34
13	76.22
14	79.67
15	77.07
16	73.49
17	73.20
18	74.36
19	72.32
20	71.30
21	74.97
22	75.04
23	66.84
24	03.37

`XLNetEmbeddings`

To use the new XLNet embeddings in flair just do:

from flair.data import Sentence
from flair.embeddings import XLNetEmbeddings

embeddings = XLNetEmbeddings()

s = Sentence("Berlin and Munich are nice cities .")
embeddings.embed(s)

# Get embeddings
for token in s.tokens:
  print(token.embedding)

XLNetEmbeddings has two parameters:

model: just specify the XLNet model. pytorch-transformers currenly comes with xlnet-large-cased and xlnet-base-cased
layers: comma-separated string of layers. Default is 1, to use more layers (will be concatenated then) just pass: 1,2,3,4
pooling_operation: defines the pooling operation of subwords. By default first and last subword embeddings are concatenated and used. Other pooling operations are also available: first, last and mean

stefan-it commented 5 years ago

Transformer-XL

I also ran some per-layer analysis for the Transformer-XL embeddings:

Layer	F-Score (0.1 downsampled CoNLL-2003 NER corpus)
1	80.88
2	81.68
3	82.88
4	80.89
5	84.74
6	80.68
7	82.65
8	79.53
9	79.25
10	79.64
11	80.07
12	84.26
13	81.22
14	80.59
15	81.31
16	78.95
17	79.85
18	80.69

Experiments with combination of layers:

Layers	F-Score (0.1 downsampled CoNLL-2003 NER corpus)
1,2	80.84
1,2,3	81.99
1,2,3,4	78.44
1,2,3,4,5	80.89

That's the reason why I choose layers 1,2,3 as default for the TransformerXLEmbeddings for now.

`TransformerXLEmbeddings`

TransformerXLEmbeddings has two parameters:

model: just specify the Transformer-XL model. pytorch-transformers currenly comes with transfo-xl-wt103
layers: comma-separated string of layers. Default are 1,2,3, to use more layers (will be concatenated then) just pass: 1,2,3,4

stefan-it commented 5 years ago

@alanakbik I'm planning to run per-layer analysis for all of the Transformer-based models.

However, it is really hard to give a recommendation for default layer(s).

Recently, I found this NAACL paper: Linguistic Knowledge and Transferability of Contextual Representations that uses a "scalar mix" of all layers. An implementation can be found in the allennlp repo, see it here. Do you have any idea how we can use this technique here 🤔

Would be awesome if we can adopt that 🤗

alanakbik commented 5 years ago

@stefan-it from the paper it also seems that best approach / layers would vary by task so giving overall recommendations might be really difficult. From a quick look it seems like their implementation could be integrated as part of an embeddings class, i.e. after retrieving the layers put them through this code. Might be interesting to try out!

stefan-it commented 5 years ago

OpenAI GPT-1

Here somes some per-layer analysis for the first GPT model:

Layer	F-Score (0.1 downsampled CoNLL-2003 NER corpus)
1	74.90
2	66.93
3	64.62
4	67.62
5	62.01
6	58.19
7	53.13
8	55.19
9	52.61
10	52.02
11	64.59
12	70.08

I implemented a first prototype of the scalar mix approach. I was able to get a F-Score of 71.01 (over all layers, incl. word embedding layer)!

`OpenAIGPTEmbeddings`

The OpenAIGPTEmbeddings comes with three parameters: model, layers and pooling_operation.

ilham-bintang commented 5 years ago

XLNet

I ran some per-layer analysis with the large XLNet model:

Layer F-Score (0.1 downsampled CoNLL-2003 NER corpus) 1 82.91 2 81.94 3 80.10 4 82.62 5 84.16 6 79.19 7 80.76 8 81.85 9 82.64 10 74.29 11 78.99 12 79.34 13 76.22 14 79.67 15 77.07 16 73.49 17 73.20 18 74.36 19 72.32 20 71.30 21 74.97 22 75.04 23 66.84 24 03.37

XLNetEmbeddings

To use the new XLNet embeddings in flair just do:
from flair.data import Sentence
from flair.embeddings import XLNetEmbeddings

embeddings = XLNetEmbeddings()

s = Sentence("Berlin and Munich are nice cities .")
embeddings.embed(s)

# Get embeddings
for token in s.tokens:
  print(token.embeddings)
XLNetEmbeddings has two parameters:

model: just specify the XLNet model. pytorch-transformers currenly comes with xlnet-large-cased and xlnet-base-cased

layers: comma-separated string of layers. Default is 1, to use more layers (will be concatenated then) just pass: 1,2,3,4

pooling_operation: defines the pooling operation of subwords. By default first and last subword embeddings are concatenated and used. Other pooling operations are also available: first, last and mean

Hi, I use branch GH-873-pytorch-transformers and try it. But it raised an error: AttributeError: 'Token' object has no attribute 'embeddings'

DecentMakeover commented 5 years ago

Not able to import

ImportError: cannot import name 'XLNetEmbeddings'

Any suggestions?

ilham-bintang commented 5 years ago

Not able to import
ImportError: cannot import name 'XLNetEmbeddings'
Any suggestions?

You need to change branch to GH-873

DecentMakeover commented 5 years ago

Thanks for the quick reply, ill check

stefan-it commented 5 years ago

@nullphantom just use:

for token in s.tokens:
  print(token.embedding)

:)

alanakbik commented 5 years ago

@stefan-it interesting results with the scalar mix! How is the effect on runtime, i.e. for instance comparing scalar mix with only one layer?

stefan-it commented 5 years ago

OpenAI GPT-2

I ran some per-layer experiments on the GPT-2 and the GPT-2 medium model:

Layer	`GPT-2` F-Score (0.1 downsampled CoNLL-2003 NER corpus)	`GPT-2 medium` F-Score (0.1 downsampled CoNLL-2003 NER corpus)
1	42.41	45.58
2	10.26	48.52
3	15.20	2.17
4	22.51	18.50
5	0.00	16.22
6	21.71	8.03
7	12.70	15.85
8	14.10	17.74
9	0.00	6.70
10	18.75	0.00
11	0.00	3.22
12	5.62	11.18
13		17.09
14		14.25
15		0.00
16		7.02
17		8.03
18		0.00
19		0.00
20		9.49
21		10.65
22		8.38
23		18.74
24		5.15

It does not look very promising, so scalar mix could help here!

`OpenAIGPT2Embeddings`

To play around with the embeddings from the GPT-2 models, just use:

from flair.data import Sentence
from flair.embeddings import OpenAIGPT2Embeddings

embeddings = OpenAIGPT2Embeddings()

s = Sentence("Berlin and Munich")
embeddings.embed(s)

for token in s.tokens:
  print(token.embedding)

stefan-it commented 5 years ago

@alanakbik In my preliminary experiments with scalar mix implementation, I couldn't see any big performance issues, but I'll measure it whenever the implementation is ready.

I'm currently focussing on per-layer analysis for the XLM model :)

alanakbik commented 5 years ago

@stefan-it cool! Really looking forward to XLM! Strange that GPT-2 is not doing so well.

stefan-it commented 5 years ago

XLM

Here are the results from a per-layer analysis for the English XLM model:

Layer	F-Score (0.1 downsampled CoNLL-2003 NER corpus)
1	76.92
2	75.91
3	75.61
4	73.52
5	73.66
6	70.75
7	70.90
8	63.58
9	64.04
10	57.38
11	54.70
12	56.96

`XLMEmbeddings`

The following snippet demonstrates the usage of the newXLMEmbeddings class:

from flair.data import Sentence
from flair.embeddings import XLMEmbeddings

embeddings = XLMEmbeddings()

s = Sentence("It is very hot in Munich now .")
embeddings.embed(s)

for token in s.tokens:
  print(token.embedding)

stefan-it commented 5 years ago

I'm currently updating the BertEmbeddings class to the new pytorch-transformers API.

Btw: using scalar mix does not help when using the OpenAIGPT2Embeddings 😞

alanakbik commented 5 years ago

Cool thanks for sharing! Really interesting to see how all these approaches fare. So far, XLNet seems to be doing best at least with individual layers.

stefan-it commented 5 years ago

I adjusted the BertEmbeddings class to make it compatible with the new pytorch-transformers API.

In order to avoid any regression bugs, I compared the performance with the old pytorch-pretrained-BERT library. Here's the per-layer analysis:

Layer	`BERT` with `pytorch-pretrained-BERT` F-Score (0.1 downsampled CoNLL-2003 NER corpus)	`BERT` with `pytorch-transformers` F-Score (0.1 downsampled CoNLL-2003 NER corpus)
1	80.15	81.35
2	79.49	82.65
3	84.20	83.44
4	83.71	84.58
5	87.71	88.81
6	86.56	87.34
7	87.61	87.13
8	86.67	85.20
9	88.17	87.73
10	89.20	85.66
11	86.65	87.44
12	85.82	87.06

stefan-it commented 5 years ago

@alanakbik Can I fill a PR for these new embeddings? Maybe we can define some kind of roadmap for the new introduced Transformer-based embeddings:

PR for the current implementations
Next step: add support for scalar mix
Next step: performance tuning (maybe we can use a batch of sentences? But then we need to implement a kind of mapping: original token and subword embeddings that belong to the original token)
Future steps: fine-tuning of these Transformer-based embeddings (instead of using a feature-based approach)

alanakbik commented 5 years ago

@stefan-it absolutely! This is a major upgrade that lots of people will want to use. With all the new features, it's probably time to do another Flair release (v0.4.3), the question is whether we wait for the features you outline or release in the very near future?

stefan-it commented 5 years ago

I'm going to work a bit on it until next week. As I just found a great suggestion/improve in the pytorch-transformers repo see issue here, I would make the following code changes:

Currently, a lot of duplicate code is used for implementing at least 5 different models.
Duplication: Pooling operation of subword based architectures, tokenization and layer concatenation

So I would be better to have a kind of generic base class :)

stefan-it commented 5 years ago

PR for the "first phase" is coming soon.

I also add support for RoBERTa, see the RoBERTa: A Robustly Optimized BERT Pretraining Approach paper for more information.

RoBERTa is currently not integrated into pytorch-transformers, so I wrote an embedding class around the torch.hub module. I tested the model, here a some results for the base model:

Layer	`RoBERTa` F-Score (0.1 downsampled CoNLL-2003 NER corpus)
1	75.16
2	80.29
3	81.01
4	80.41
5	80.52
6	80.23
7	81.31
8	84.25
9	80.12
10	78.09
11	76.50
12	81.12

stefan-it commented 5 years ago

The variance for a 0.1 downsampled CoNLL corpus is very high. I did some experiments for RoBERTa in order to compare different pooling operations for subwords using scalar mix:

Pooling operation	Run 1	Run 2	Run 3	Run 4	Avg.
`first_last`	76.12	76.56	79.12	79.17	77.74
`first`	78.91	81.41	76.79	80.00	79.28

I also used the complete CoNLL corpus (one run) with scalar mix:

Pooling operation	F-Score
`first_last`	86.97
`first`	87.40

BERT (base) achieves 92.2 (reported in their paper). Now I'm going to run some experiments with BERT (base) and scalar mix to have a better comparison :)

Update: BERT (base) achieves a F-Score of 91.38 on the full CoNLL corpus with scalar mix.

stefan-it commented 5 years ago

An update:

I've re-written the complete tokenization logic for all Transformer-based embeddings (except BERT). In the old version, I did pass each token for a sentence into the model (which is not very efficient and causes major problems with the GPT-2 tokenizer).

The latest version passes the complete sentence into the model. The embeddings for subwords are then aligned back to each "Flair" token in a sentence (I wrote some unit tests for that...).

I also added the code for scalar mix from the allennlp repo.

Here are some experiments with the new implemenation on a downsampled (0.1) CoNLL corpus for NER. F-Score is measured and averaged over 4 runs, scalar mix is used:

Model	Pooling	# 1	# 2	# 3	# 4	Avg.
RoBERTa (base)	`first`	86.34	86.30	90.21	87.28	87.53
GPT-1	`first_last`	75.21	77.53	74.90	76.33	75.99
GPT-1	`first`	74.31	75.42	74.01	76.56	75.08
GPT-2 (medium)	`first_last`	85.18	76.86	79.93	81.02	80.75
GPT-2 (medium)	`first`	78.88	79.23	80.31	76.80	78.81
XLM (en)	`first_last`	84.65	86.50	84.63	84.97	85.19
XLM (en)	`first`	86.66	88.28	87.55	85.82	87.08
Transformer-XL	-	81.03	80.17	78.67	81.34	80.53
XLNet (base)	`first_last`	85.66	88.59	85.74	87.36	86.84
XLNet (base)	`first`	88.81	86.65	86.01	85.72	86.80

I'm currently running experiments on the whole CoNLL corpus. Here are some results (only one run):

Model	Pooling	Dev	Test
BERT (base, cased)	`first`	94.74	91.38
BERT (base, uncased)	`first`	94.61	91.03
BERT (large, cased)	`first`	95.23	91.69
BERT (large, uncased)	`first`	94.78	91.49
BERT (large, cased, whole-word-masking)	`first`	94.88	91.16
BERT (large, uncased, whole-word-masking)	`first`	94.94	91.20
RoBERTa (base)	`first`	95.35	91.51
RoBERTa (large)	`first`	95.83	92.11
RoBERTa (large)	`mean`	96.31	92.31
XLNet (base)	`first_last`	94.56	90.73
XLNet (large)	`first_last`	95.47	91.49
XLNet (large)	`first`	95.14	91.71
XLM (en)	`first_last`	94.31	90.68
XLM (en)	`first`	94.00	90.73
GPT-2	`first_last`	91.35	87.47
GPT-2 (large)	`first_last`	94.09	90.63

Notice: The feature-based result from the BERT paper is 96.1 (dev) and 92.4 - 92.8 for base and large model (test). But they "include the maximal document context provided by the data". I found an issue in the allennlp repo (here) and a dev score of 95.3 seems to be possible (without using the document context).

But from these preliminary experiments, RoBERTa (/cc @myleott) seems to perform slightly better at the moment :)

alanakbik commented 5 years ago

@stefan-it were you using the scalar mix in these experiments on the full CoNLL? Were you always using the default layers as set in the constructor of each class?

stefan-it commented 5 years ago

I used scalar mix for all layers (incl. word embedding layer, which is located at index 0) on the full CoNLL. E.g. for RoBERTa the init. would be:

emb = RoBERTaEmbeddings(model="roberta.base", layers="0,1,2,3,4,5,6,7,8,9,10,11,12", pooling_operation="first", use_scalar_mix=True)

I'm not sure about the default parameters when using no scalar mix, because there's no literature about that, except BERT (base), where a concat of the last four layers was proposed.

alanakbik commented 5 years ago

Ah great - ok, I'll run a similar experiment and report numbers. But aside from this, I think we are ready to merge the PR.

alanakbik commented 5 years ago

BTW here some results with using RoBERTa with default parameters and scalar mix, i.e. instantiated like this:

RoBERTaEmbeddings(use_scalar_mix=True)

Results of three runs:

# 1	# 2	# 3
92.03	92.05	91.96

Using otherwise the exact same parameters as here.

stefan-it commented 5 years ago

Documentation for the new PyTorch-Transformers embeddings are coming very soon :)

I'll close that issue now (PR was merged).

DecentMakeover commented 5 years ago

@stefan-it if i do git pull will i be able to access these new additions or do i have to checkout to GH-873? Thanks

stefan-it commented 5 years ago

You can just use the latest master branch :) Or install it via:

pip install --upgrade git+https://github.com/zalandoresearch/flair.git

:)

DecentMakeover commented 5 years ago

okay thanks !

DecentMakeover commented 5 years ago

@stefan-it Even after running pip install --upgrade git+https://github.com/zalandoresearch/flair.git

When i try to import from flair.embeddings import XLNetEmbeddings

i get ImportError: cannot import name 'XLNetEmbeddings'

dshaprin commented 5 years ago

@DecentMakeover You can try again, I installed the latest version of flair and the problem disappeared. The last commit is before 5 days.

DecentMakeover commented 5 years ago

@dshaprin okay ,ill check.

flairNLP / flair

pytorch-pretrained-bert to pytorch-transformers upgrade #873

XLNet

`XLNetEmbeddings`

Transformer-XL

`TransformerXLEmbeddings`

OpenAI GPT-1

`OpenAIGPTEmbeddings`

XLNet

`XLNetEmbeddings`

OpenAI GPT-2

`OpenAIGPT2Embeddings`

XLM

`XLMEmbeddings`