Improve Tokenizer for uppercase text

HariWu1995 commented 3 years ago

Recently, I encounter a situation that WordPiece Tokenizer works pretty poorly with uppercase text. When I debug, I discover that this tokenizer tokenizes uppercase text into non-sense tokens - they are meaningless couples of characters. For example: airport bus service is excellent --> ['airport', 'bus', 'service', 'is', 'excellent'] AIRPORT BUS SERVICE IS EXCELLENT --> ['AI', '##R', '##PO', '##RT', 'B', '##US', 'SE', '##R', '##VI', '##CE', 'IS', 'EX', '##CE', '##LL', '##EN', '##T']

I still want to retain the uppercase because it implies the sentiment of the writer. So how can I improve the tokenizer?

nreimers commented 3 years ago

Hi, 1) Just because it is broken down into more word pieces, does not mean that the derived embeddings are worse.

2) You could add new tokens to the tokenizers are described here: https://github.com/huggingface/transformers/issues/1413

But then you need to do pre-training with Mask Language Model so that these tokens are learned.

JonathanAlis commented 2 years ago

Couldn't the tokenizers include a special token that indicates the beggining of an uppercase sequence of words? Maybe '[UP]'(['LO'] to finish the sequence). So, the example could be tokenized as airport bus service is excellent --> ['airport', 'bus', 'service', 'is', 'excellent'] AIRPORT BUS SERVICE IS EXCELLENT --> ['[UP]', 'airport', 'bus', 'service', 'is', 'excellent','[LO]'] So, References to the same tokens, could be easy for the models to recognize the corespondence, of the same words in different capitalization scenarios, and the role of capitalization in the contexts.

Also, a special token indicating that the next word starts in uppercase could be included as well (maybe '[FU]', from first upper). So, information like first word of sentence, proper names and etc could be embedded in the models without mentioning to other token ids.

Just ideas, i have no idea how to implement on how to do it.

creat89 commented 2 years ago

Hello,

I don't know if it is still of relevance, but I explored something similar in this paper:

http://ceur-ws.org/Vol-2829/paper2.pdf

https://github.com/EMBEDDIA/NER_BERT_Multitask

UKPLab / sentence-transformers

Improve Tokenizer for uppercase text #917