facebookresearch / SimulEval

SimulEval: A General Evaluation Toolkit for Simultaneous Translation
Creative Commons Attribution Share Alike 4.0 International
102 stars 36 forks source link

Postprocessor Improvement #75

Open lh5844 opened 1 year ago

lh5844 commented 1 year ago

This addresses the issue of the sentence piece model not correcting when two words should be together. For example, if there were two text segments with the first one ending in "with" and the second one beginning with "out", the model would identify it as two different words. However, we want the two to be together as "without", and this would involve correcting the prediction list, delays list, and elapsed list for latency accuracy.


To run the spm_detokenizer_agent.py, use this command in the SimulEval directory:

simuleval  \ 
    --user-dir examples  \
    --agent-class examples.quick_start.spm_detokenizer_agent.DummyPipeline \
    --source examples/quick_start/spm_source.txt \
    --target examples/quick_start/spm_target.txt  \
    --output tmp_output \
     --segment-k 3  \
    --sentencepiece-model examples/quick_start/tokenizer.model \
    --detokenize-only 

This is the expected output for