Open HariWu1995 opened 3 years ago
Hi, 1) Just because it is broken down into more word pieces, does not mean that the derived embeddings are worse.
2) You could add new tokens to the tokenizers are described here: https://github.com/huggingface/transformers/issues/1413
But then you need to do pre-training with Mask Language Model so that these tokens are learned.
Couldn't the tokenizers include a special token that indicates the beggining of an uppercase sequence of words? Maybe '[UP]'(['LO'] to finish the sequence). So, the example could be tokenized as airport bus service is excellent --> ['airport', 'bus', 'service', 'is', 'excellent'] AIRPORT BUS SERVICE IS EXCELLENT --> ['[UP]', 'airport', 'bus', 'service', 'is', 'excellent','[LO]'] So, References to the same tokens, could be easy for the models to recognize the corespondence, of the same words in different capitalization scenarios, and the role of capitalization in the contexts.
Also, a special token indicating that the next word starts in uppercase could be included as well (maybe '[FU]', from first upper). So, information like first word of sentence, proper names and etc could be embedded in the models without mentioning to other token ids.
Just ideas, i have no idea how to implement on how to do it.
Hello,
I don't know if it is still of relevance, but I explored something similar in this paper:
Recently, I encounter a situation that WordPiece Tokenizer works pretty poorly with uppercase text. When I debug, I discover that this tokenizer tokenizes uppercase text into non-sense tokens - they are meaningless couples of characters. For example: airport bus service is excellent --> ['airport', 'bus', 'service', 'is', 'excellent'] AIRPORT BUS SERVICE IS EXCELLENT --> ['AI', '##R', '##PO', '##RT', 'B', '##US', 'SE', '##R', '##VI', '##CE', 'IS', 'EX', '##CE', '##LL', '##EN', '##T']
I still want to retain the uppercase because it implies the sentiment of the writer. So how can I improve the tokenizer?