Tokenizer comparison - Githubissues

Janluke0 commented 2 years ago

Tokenizers

name	avg_sentence_len	max_sentence_len	min_sentence_len	used_tokens	vocab_size
BPE	38.7161	87	4	1920	2048
WordPiece	36.2071	84	3	2017	2048
BERT_pretrained	28.2618	76	1	12355	31102
DBERT_pretrained	33.1526	79	1	10843	119547
ELECTRA_pretrained	30.103	78	1	13503	31102
ROBERTA_pretrained	31.6816	75	1	13065	250002

Models

model	# params (no emb.)	dropout
LSTM (64)	95518	10%
LSTM (128)	371230	10%
GRU (128)	371230	10%
Self Att. (128,1hd,1lyr)	202398	10%
Self Att. 2 (128,2hd,3lyr)	598942	50%
Trans (128,1hd,1lyr)	467230	10%
Trans (256,4hd,2lyr)	3695134	10%

IO

x_n is the n-th token of the sententece

y_n is the n-th pos tag, different from the special token Explicit pad only if x_n is the first token of the word

This last choice requires a pretokenization with the same criterions used to build the used dataset, but will allow an easy combination of different tokenizer.

The only-first-token approch didn't change the performace significatively in preliminary experiments.

LSTM/GRU

LSTM and GRU models share the same bidirectional 2 level architecture architecture.

flowchart LR;
    x_0((x_0))-->e0[Embedding]-->
    Lin_0[LSTM_IN]-->Lout_0[LSTM_out]-->Clf_0[Linear]
    -->y_0((y_0)); 
    x_f((x_f))-->ef[Embedding]-->
    Lin_1[LSTM_IN]-->Lout_1[LSTM_out]-->Clf_1[Linear]
    --> y_f((y_f));

Self Attention

TransformerEncoder part only. Key masking is applied for masking out pads.

Transformer (Token2Tag)

Full transformer architecture trained directly on the final task, in this case explicit padding wasn't used being it able to produce an output of different length.

But the requirement of pretokenizzation is stil required to link word and tag

Models on tokenizer performance

Accurrancy is computed ignoring BOS/EOS and padding labels

Max accuracy

name	GRU	LSTM64	SA	LSTM128
BPE	92.49	91.42	83.07	92.04
WordPiece	91.87	90.53	83.34	91.19
BERT_pretrained	90.15	89.32	86.2	90.42
DBERT_pretrained	91.07	89.60	84.42	90.96
ELECTRA_pretrained	90.53	89.93	85.86	90.39
ROBERTA_pretrained	91.66	90.19	85.86	91.56

The max # of epochs is 2000 The earlystopping condition was put on the validation accuracy, for a improvement under 10^-7

epochs

name	GRU	LSTM64	SA	LSTM128
BPE	~900	~500	~1300	~1400
WordPiece	2000	~500	~450	~1000
BERT_pretrained	~1400	~450	~450	~1600
DBERT_pretrained	2000	~500	~500	~1750
ELECTRA_pretrained	~1750	~450	~500	2000
ROBERTA_pretrained	2000	~700	2000	2000

Janluke0 commented 2 years ago

https://github.com/Janluke0/PoS-Tagging/commit/bbd0d53bd3e975f5ab68784292c2dc10e86633ee This error has result in the inclusion of the PAD token in the loss computation.

Accuracy was not affected.

A re-run is required

Janluke0 commented 2 years ago

Models on tokenizer performance

After the bugfix

Max accuracy

name	GRU	LSTM64	SA	LSTM128	TRANS	SA2
BPE	#	#	83.38	92.82	68.67	87.90
WordPiece	#	#	83.63	90.62	69.66	88.04
BERT_pretrained	#	#	86.34	90.35	61.22	88.61
DBERT_pretrained	#	#	84.34	90.62	60.04	86.81
ELECTRA_pretrained	#	#	85.97	90.51	58.18	88.28
ROBERTA_pretrained	#	#	87.11	91.37	59.01	89.19

Epochs

name	GRU	LSTM64	SA	LSTM128	TRANS	SA2
BPE	#	#	2000	2000	~1750	~1000
WordPiece	#	#	~450	~450	~800	~1400
BERT_pretrained	#	#	~500	~1750	~500	~500
DBERT_pretrained	#	#	~500	~1250	~500	~120
ELECTRA_pretrained	#	#	~600	2000	~500	~650
ROBERTA_pretrained	#	#	2000	2000	~500	~1100

Janluke0 / PoS-Tagging

Tokenizer comparison #1

Tokenizers

Models

IO

LSTM/GRU

Self Attention

Transformer (Token2Tag)

Models on tokenizer performance

Max accuracy

epochs

Models on tokenizer performance

Max accuracy

Epochs