Postprocessor Improvement

This addresses the issue of the sentence piece model not correcting when two words should be together. For example, if there were two text segments with the first one ending in "with" and the second one beginning with "out", the model would identify it as two different words. However, we want the two to be together as "without", and this would involve correcting the prediction list, delays list, and elapsed list for latency accuracy.

To run the spm_detokenizer_agent.py, use this command in the SimulEval directory:

simuleval  \ 
    --user-dir examples  \
    --agent-class examples.quick_start.spm_detokenizer_agent.DummyPipeline \
    --source examples/quick_start/spm_source.txt \
    --target examples/quick_start/spm_target.txt  \
    --output tmp_output \
     --segment-k 3  \
    --sentencepiece-model examples/quick_start/tokenizer.model \
    --detokenize-only

This is the expected output for

instances.log

{"index": 0, "prediction": "Let's do it without hesitation.", "delays": [3, 6, 6, 9, 9], "elapsed": [0, 0, 0, 0, 0],
"prediction_length": 5, "reference": "Let's do it without hesitation.\n", "source": "\u2581Let ' s \u2581do 
\u2581it \u2581with out \u2581hesitation .", "source_length": 9}

metrics.tsv

LAAL    AL  AP  DAL
3.3 3.3 0.733   3.96

scores.tsv

BLEU    LAAL    AL  AP  DAL ATD
100.0   3.3 3.3 0.733   3.96    5.0

facebookresearch / SimulEval

Postprocessor Improvement #75