Open hitcslj opened 4 weeks ago
I utilized SentencePiece in a purely Chinese character setting to generate BPE candidate tokens. Subsequently, I executed the VOLT scripts using the following command:
python3 ../ot_run.py --source_file spmoutput/source.file \ --token_candidate_file spm.vocab \ --vocab_file spmoutput/vocab --max_number 10000 --interval 1000 --loop_in_ot 500 --tokenizer sentencepiece --size_file spmoutput/size
The resulting vocabulary is a reduced version of the original, adhering to the following relationship: vocab = vocab[:optimal_size].
Is the purpose of the first two steps merely to determine the best_size, or is there more to it? Additionally, does the third step involve retraining a vocabulary from scratch?
#sentencepiece style
#for sentencepiece toolkit, here we only keep the optimal size
best_size=$(cat spmoutput/size)
#training_data contains source data and target data (optional if target data is provided)
python3 spm/spm_train.py --input=training_data --model_prefix=spm --vocab_size=$best_size --character_coverage=1.0 --model_type=bpe
python3 spm/spm_encoder.py --model spm.model --inputs source_file --outputs spmout/source.file --output_format piece
python3 spm/spm_encoder.py --model spm.model --inputs target_file --outputs spmout/target.file --output_format piece #optional if your task does not contain target texts
Has the repo been implemented solely to find the optimal size, such as in the snippet vocab = vocab[:optimal_size].