Janluke0 / PoS-Tagging

1 stars 0 forks source link

Tokenizer comparison #1

Open Janluke0 opened 2 years ago

Janluke0 commented 2 years ago

Tokenizers

name avg_sentence_len max_sentence_len min_sentence_len used_tokens vocab_size
BPE 38.7161 87 4 1920 2048
WordPiece 36.2071 84 3 2017 2048
BERT_pretrained 28.2618 76 1 12355 31102
DBERT_pretrained 33.1526 79 1 10843 119547
ELECTRA_pretrained 30.103 78 1 13503 31102
ROBERTA_pretrained 31.6816 75 1 13065 250002

Models

model # params (no emb.) dropout
LSTM (64) 95518 10%
LSTM (128) 371230 10%
GRU (128) 371230 10%
Self Att. (128,1hd,1lyr) 202398 10%
Self Att. 2 (128,2hd,3lyr) 598942 50%
Trans (128,1hd,1lyr) 467230 10%
Trans (256,4hd,2lyr) 3695134 10%

IO

x_n is the n-th token of the sententece

y_n is the n-th pos tag, different from the special token Explicit pad only if x_n is the first token of the word

This last choice requires a pretokenization with the same criterions used to build the used dataset, but will allow an easy combination of different tokenizer.

The only-first-token approch didn't change the performace significatively in preliminary experiments.

LSTM/GRU

LSTM and GRU models share the same bidirectional 2 level architecture architecture.

flowchart LR;
    x_0((x_0))-->e0[Embedding]-->
    Lin_0[LSTM_IN]-->Lout_0[LSTM_out]-->Clf_0[Linear]
    -->y_0((y_0)); 
    x_f((x_f))-->ef[Embedding]-->
    Lin_1[LSTM_IN]-->Lout_1[LSTM_out]-->Clf_1[Linear]
    --> y_f((y_f));

Self Attention

TransformerEncoder part only. Key masking is applied for masking out pads.

Transformer (Token2Tag)

Full transformer architecture trained directly on the final task, in this case explicit padding wasn't used being it able to produce an output of different length.

But the requirement of pretokenizzation is stil required to link word and tag

Models on tokenizer performance

Accurrancy is computed ignoring BOS/EOS and padding labels

Max accuracy

name GRU LSTM64 SA LSTM128
BPE 92.49 91.42 83.07 92.04
WordPiece 91.87 90.53 83.34 91.19
BERT_pretrained 90.15 89.32 86.2 90.42
DBERT_pretrained 91.07 89.60 84.42 90.96
ELECTRA_pretrained 90.53 89.93 85.86 90.39
ROBERTA_pretrained 91.66 90.19 85.86 91.56

The max # of epochs is 2000 The earlystopping condition was put on the validation accuracy, for a improvement under 10^-7

epochs

name GRU LSTM64 SA LSTM128
BPE ~900 ~500 ~1300 ~1400
WordPiece 2000 ~500 ~450 ~1000
BERT_pretrained ~1400 ~450 ~450 ~1600
DBERT_pretrained 2000 ~500 ~500 ~1750
ELECTRA_pretrained ~1750 ~450 ~500 2000
ROBERTA_pretrained 2000 ~700 2000 2000
Janluke0 commented 2 years ago

https://github.com/Janluke0/PoS-Tagging/commit/bbd0d53bd3e975f5ab68784292c2dc10e86633ee This error has result in the inclusion of the PAD token in the loss computation.

Accuracy was not affected.

A re-run is required

Janluke0 commented 2 years ago

Models on tokenizer performance

After the bugfix

Max accuracy

name GRU LSTM64 SA LSTM128 TRANS SA2
BPE # # 83.38 92.82 68.67 87.90
WordPiece # # 83.63 90.62 69.66 88.04
BERT_pretrained # # 86.34 90.35 61.22 88.61
DBERT_pretrained # # 84.34 90.62 60.04 86.81
ELECTRA_pretrained # # 85.97 90.51 58.18 88.28
ROBERTA_pretrained # # 87.11 91.37 59.01 89.19

Epochs

name GRU LSTM64 SA LSTM128 TRANS SA2
BPE # # 2000 2000 ~1750 ~1000
WordPiece # # ~450 ~450 ~800 ~1400
BERT_pretrained # # ~500 ~1750 ~500 ~500
DBERT_pretrained # # ~500 ~1250 ~500 ~120
ELECTRA_pretrained # # ~600 2000 ~500 ~650
ROBERTA_pretrained # # 2000 2000 ~500 ~1100