Only vocab = vocab[:optimal_size]?

hitcslj commented 4 weeks ago

Has the repo been implemented solely to find the optimal size, such as in the snippet vocab = vocab[:optimal_size].

hitcslj commented 4 weeks ago

I utilized SentencePiece in a purely Chinese character setting to generate BPE candidate tokens. Subsequently, I executed the VOLT scripts using the following command:

python3 ../ot_run.py --source_file spmoutput/source.file \ --token_candidate_file spm.vocab \ --vocab_file spmoutput/vocab --max_number 10000 --interval 1000 --loop_in_ot 500 --tokenizer sentencepiece --size_file spmoutput/size

The resulting vocabulary is a reduced version of the original, adhering to the following relationship: vocab = vocab[:optimal_size].

hitcslj commented 4 weeks ago

Is the purpose of the first two steps merely to determine the best_size, or is there more to it? Additionally, does the third step involve retraining a vocabulary from scratch?

  #sentencepiece style
  #for sentencepiece toolkit, here we only keep the optimal size
  best_size=$(cat spmoutput/size)
  #training_data contains source data and target data (optional if target data is provided)
  python3 spm/spm_train.py --input=training_data --model_prefix=spm --vocab_size=$best_size --character_coverage=1.0 --model_type=bpe
  python3 spm/spm_encoder.py --model spm.model --inputs source_file --outputs spmout/source.file --output_format piece
  python3 spm/spm_encoder.py --model spm.model --inputs target_file --outputs spmout/target.file --output_format piece #optional if your task does not contain target texts

Jingjing-NLP / VOLT

Only vocab = vocab[:optimal_size]? #31