Classic Word Embedding baseline in POS far below that reported in the paper

alexandres commented 5 years ago

Hi,

First thank you for the great work on this library! :)

I'm trying to replicate the "Classic Word Embedding + BiLSTM-CRF" result from http://aclweb.org/anthology/C18-1139 on the PTB POS dataset of 96.94 ± 0.02 accuracy.

I followed the instructions at https://github.com/zalandoresearch/flair/blob/master/resources/docs/EXPERIMENTS.md#penn-treebank-part-of-speech-tagging-english

My code and corpus statistics are available at https://gist.github.com/alexandres/a54506e31d038cce75f31d09c60c9df8

My corpus statistics exactly match those from https://nlp.stanford.edu/pubs/CICLing2011-manning-tagging.pdf , namely:

Unfortunately my POS accuracy is around 94% with the "Classic Word Embedding + BiLSTM-CRF" using the Komninos embeddings.

Any idea what I'm doing wrong?

Note: I notice that the embeddings are not fine-tuned during training. There is no mention of this in the paper. Perhaps this is the cause?

Thanks!

alanakbik commented 5 years ago

Hi @alexandres that is strange - your code looks good.

Could you try going back to Flair version 0.2. and run the experiment again with the instructions in the 0.2. EXPERIMENTS.md?

alexandres commented 5 years ago

Thanks a bunch @alanakbik . That did it!

On latest (pip install flair) release, scores at the end of training:

2019-02-27 23:49:38,087 loading file resources/taggers/pos-extvec/best-model.pt     
2019-02-27 23:50:44,474 MICRO_AVG: acc 0.9358 - f1-score 0.9668                     
2019-02-27 23:50:44,475 MACRO_AVG: acc 0.876 - f1-score 0.9218586956521738

On v0.2.0, after a single epoch:

0       (11:45:56)      11.641517       0       0.100000        DEV   7082      0.9462540222208731      TEST    7095 0.9452774307001712

So 0.9358 for 150 epochs vs 0.94625 for 1 epoch.

Thanks! You saved me a lot of time.

alanakbik commented 5 years ago

Cool, thanks for checking this out!

For us, this means we have to take a closer look what changed between the versions. Generally, quality should get better with newer versions not the other way around :) We'll take a look!

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

flairNLP / flair

Classic Word Embedding baseline in POS far below that reported in the paper #573