Bert type Tokenisers - Githubissues

apmoore1 commented 4 years ago

The BERT/transformer tokenisers normally require first processing the text through a tokeniser like Spacy and then performing BPE afterwards from what I believe. Adding this type of tokeniser to the library is required for the sequence labelling tasks so that the sequence labels can be produced. This is not needed for the text classification tasks as the tokenisation can happen with the Allennlp dataset reader using there pretrained_transformer_tokenizer. Also to consider is the fact that the Spacy and the transformer tokenizers give you the offsets of the tokens for free which we use in this library.

apmoore1 commented 4 years ago

Note as the BPE from the BERT/Transformer tokenisers first using something like Spacy this means that the tokenisation error will be the same as that of the Spacy tokeniser.

apmoore1 commented 4 years ago

This is no longer a problem as the AllenNLP framework has incorporated BERT using HugginFace Transformers into the Embedding layer handling the BPE tokenisation after the Spacy tokenisation. See pretrained_transformer_mismatched_embedder

apmoore1 / target-extraction

Bert type Tokenisers #15