WorksApplications / SudachiTra

Japanese tokenizer for Transformers
Apache License 2.0
77 stars 10 forks source link

Vocabulary file handling #57

Open mh-northlander opened 1 year ago

mh-northlander commented 1 year ago

JapaneseWordPieceTokenizer which we use to build the vocabulary recognizes '\n' (or ' ') as a token. BertSudachipyTokenizer however removes them from the tokenization results. Currently we just ignore those tokens (and problems caused by that (#54)).

  1. We may need some error handling on the vocab file corruption.

  2. It maybe better to make those tokens used. In this case we need to prepare a new vocab file format (current txt format cannot handle '\n'). We also need to modify chiTra tokenizer, and reconsider the corpus cleaning processes relating to those tokens.

  3. In the case we do not use those tokens, we should remove them during vocab building.