Talking about other things that SentencePiece does, it has some other features that may replace pre-post-process.sh scripts. By default it applies NFKC normalization, but can be customized. The default normalization already does some of the preprocess.sh like:
Talking about other things that SentencePiece does, it has some other features that may replace pre-post-process.sh scripts. By default it applies NFKC normalization, but can be customized. The default normalization already does some of the preprocess.sh like:
If the user needs to add more normalization or change it, it can be borrowed from here https://github.com/google/sentencepiece/tree/master/data, modify it and provide it in the
spm_train
step and forget about preprocessing.