Pre- and post-processing text in Simuleval

I am playing with the MMA-hard model to replicate WMT15 DE-EN experiments reported in the paper and my question is about preprocessing and postprocessing data. The paper says that:

For each dataset, we apply tokenization with the Moses (Koehn et al., 2007) tokenizer and preserve casing. We apply byte pair encoding (BPE) (Sennrich et al., 2016) jointly on the source and target to construct a shared vocabulary with 32K symbols

Following what is said above, I applied moses scripts to tokenize raw files and applied BPE to the tokenized files. Then, tokenized and BPE applied train, val and test files were binarized using following fairseq preprocess command:

fairseq-preprocess --source-lang de --target-lang en \
    --trainpref ~/wmt15_de_en_32k/train --validpref ~/wmt15_de_en_32k/valid --testpref ~/wmt15_de_en_32k/test \
    --destdir ~/wmt15_de_en_32k/data-bin/ \
    --workers 20

Afer that, I trained a MMA-hard model using the binarized data. Now, I would like to evaluate (w.r.t. Latency and Bleu) a checkpoint using SimulEval. My first question is about the file format: Which format should I provide the test files as --source and --target to simuleval command? There are three options as far as I can see:

Using Raw files.
Using tokenized files
Using tokenized and bpe applied files.

I am following EN-JA waitk model's agent file to understand what should be done. However, the difference between the experiment I'd like to replicate and EN-JA experiment is that in EN-JA sentencepiece model is used for tokenization whereas in my case moses is used and also bpe is applied.

So, I tried following:

I provided path of TOKENIZED files as --source and --target to simuleval. Also, I've implemented segment_to_units and build_word_splitter functions as follows but I couldn't figure out how I should implement units_to_segment.

I tried to test this implementation as follows:

$ head -n 1 ~/wmt15_de_en_32k/tmp/test.de
Die Premierminister Indiens und Japans trafen sich in Tokio .
$ head -n 1 ~/wmt15_de_en_32k/tmp/test.en
India and Japan prime ministers meet in Tokyo

simuleval --agent mma-dummy/mmaAgent.py --source ~/wmt15_de_en_32k/tmp/test.de  \
--target  ~/wmt15_de_en_32k/tmp/test.en  --data-bin ~/wmt15_de_en_32k/data-bin/  \
--model-path ~/checkpoints/checkpoint_best.pt --bpe_code ~/wmt15_de_en_32k/code

So, my questions are:

Is it correct to provide tokenized but not bpe applied test files as --source and --target to simuleval?
Do implementations of segment_to_units and build_word_splitter functions seem correct?
Could you please explain how units_to_segment and update_states_write should be implemented?

Edit: When I evaluate the best checkpoint on a subset of test-set using the above code I got the following output:

2021-09-19 22:10:08 | WARNING | sacrebleu | That's 100 lines that end in a tokenized period ('.') 2021-09-19 22:10:08 | WARNING | sacrebleu | It looks like you forgot to detokenize your test data, which may hurt your score. 2021-09-19 22:10:08 | WARNING | sacrebleu | If you insist your data is detokenized, or don't care, you can suppress this message with '--force'. 2021-09-19 22:10:08 | INFO | simuleval.cli | Evaluation results: { "Quality": { "BLEU": 6.068334932433579 }, "Latency": { "AL": 7.8185020314753055, "AP": 0.833324143320322, "DAL": 11.775593814849854 } }

Hey @xutaima I kindly remind you that I'm looking for your help on this issue. The bleu score I got on a subset of wmt15 de-en test set is significantly worse than what is reported on the paper.

@xutaima could you please let ppl know if you are interested in making this code base work (and if so, when)? I'd like to indicate that we need your help because both MMA and simuleval have been mainly developed by you and people are having reeealy difficult time to make them run ;)

@kurtisxx While I sincerely appreciate your strong interests in our work and kind reminders, I don't think you really have to remind me in this frequency. I would do my best to reply as soon as possible.

First of all, I do apologize the code and document for MMA, especially on text-to-text part is out of date. These years we are shifting our focus more on speech-to-text simultaneous translation. Also the idea of SimulEval came after the MMA paper, that we would like to have a generic framework of evaluating simultaneous translation models, where in MMA paper the evaluation is ad hoc. We haven't managed to finish all the updates but been working on it.

As to your questions

Is it correct to provide tokenized but not bpe applied test files as --source and --target to simuleval? SimulEval assumes both --source and --target are raw files. The reason is that tokenizer can be different in different work. We suggest the latency evaluated on detokenzied text. (However notice that the latency in MMA paper is reported based on bpe tokens, so the same as other previous works, which is not the best way to evalaute latency)
Do implementations of segment_to_units and build_word_splitter functions seem correct? The tokenization and subword splitting is supposed to be done in agent segment_to_units function. I see you already have the BPE (word splitter) so probably also considering apply the tokenizer here.
Could you please explain how units_to_segment and update_states_write should be implemented? units_to_segment on the other hand is to merge subwords and detokenize . units_to_segment function assumes a detokenized word is returned (not only de-bpe-d). update_states_write is a hook function after a prediction is sent to server, it could be useful when there is queue to store the subwords and clear the queue after merging the subwords.
As to the performance A subset of a test set can be biased. I could suggest using full set score. Meanwhile the performance can be affected by a lot of factors, such as data preparing, training hyperparamters. I was wondering if you could provide the training log so I could look deeper into it.

Thank you for your reply, @xutaima.

As to the performance A subset of a test set can be biased. I could suggest using full set score. Meanwhile the performance can be affected by a lot of factors, such as data preparing, training hyperparamters. I was wondering if you could provide the training log so I could look deeper into it.

Sure, here is my training log:

INFO | fairseq_cli.train | Stopping training due to num_updates: 50000 >= max_update: 50000
INFO | fairseq_cli.train | begin validation on "valid" subset
INFO | valid | epoch 002 | valid on 'valid' subset | loss 4.989 | nll_loss 3.355 | ppl 10.23 | wps 35099.2 | wpb 2940.6 | bsz 98.1 | num_updates 50000 | best_loss 4.989
INFO | fairseq.checkpoint_utils | Preparing to save checkpoint for epoch 2 @ 50000 updates
INFO | fairseq.trainer | Saving checkpoint to checkpoints/checkpoint_best.pt
INFO | fairseq.trainer | Finished saving checkpoint to checkpoints/checkpoint_best.pt
INFO | fairseq.checkpoint_utils | Saved checkpoint checkpoints/checkpoint_best.pt (epoch 2 @ 50000 updates, score 4.989) (writing took 15.849505523998232 seconds)
INFO | fairseq_cli.train | end of epoch 2 (average epoch stats below)                                                              
INFO | train | epoch 002 | loss 5.263 | nll_loss 3.765 | ppl 13.6 | wps 7662.2 | ups 2.43 | wpb 3150.1 | bsz 104.9 | num_updates 50000 | lr 0.000141421 | gnorm 1.568 | train_wall 4984 | gb_free 11.7 | wall 20474
INFO | fairseq_cli.train | done training in 20472.2 seconds

I have run the training command given for MMA-Hard model in [README (https://github.com/pytorch/fairseq/blob/master/examples/simultaneous_translation/docs/ende-mma.md).

When I evaluate the best_checkpoint of this training on the full test set, I got the following scores:


    "Quality": {
        "BLEU": 6.223767659491562
    },
    "Latency": {
        "AL": 8.13375314221893,
        "AP": 0.838396645474401,
        "DAL": 12.109430727072716
    }

What is more interesting is its performance on a subset from training set. After evaluating on the test set, I've also created a subset with 200 examples from training set on which I evaluated the model:

Warning: these hypothesis don't have EOS in predictions
21
{
    "Quality": {
        "BLEU": 8.420072093969006
    },
    "Latency": {
        "AL": 8.171652655601502,
        "AP": 0.7786643338203431,
        "DAL": 13.193016805648803
    }
}

My comments to your answers above:

The tokenization and subword splitting is supposed to be done in agent segment_to_units function. I see you already have the BPE (word splitter) so probably also considering apply the tokenizer here.

I didn't quite understand how tokenization can be done in segment_to_units function. I was thinking that segment is only a word/token not the full sentence? Do you mean running a tokenizer like moses on a word level? Does that make sense? Also, this en-ja code here says that segment_to_words # Split a full word (segment) into subwords (units)?

units_to_segment on the other hand is to merge subwords and detokenize . units_to_segment function assumes a detokenized word is returned (not only de-bpe-d). update_states_write is a hook function after a prediction is sent to server, it could be useful when there is queue to store the subwords and clear the queue after merging the subwords.

As you can see above, I am merging subwords and clearing queue after merging subwords in the units_to_segment function not in update_states_write. Is this okay?

What's you hardware setup? Can you share the full log including the configurations? There's only 2 epochs which doesn't look normal to me.

As to tokenizer, yes it makes sense on word level. In real time translation there is no full source sentence. In en-ja example, we use sentencepiecemodel which serves both word splitter and tokenizer, so they could be done in one pass.

units_to_segment looks good to me on merging subwords, but again detokenization should also be considered. It would be helpful if you could share full inference log to make sure the server received the correct prediction.

Thanks for you reply, @xutaima! I appreciate that! I used a single v100 GPU for training and my operating system is Linux. I think the training stops in the middle of second epoch because --max-iter is set to 50000 as instructed in the readme file. I used following command to start training:

fairseq-train ./data-bin/wmt15_de_en     --simul-type hard_aligned  \
--mass-preservation     --criterion latency_augmented_label_smoothed_cross_entropy \
--latency-weight-var  0.1     --max-update 50000     --arch transformer_monotonic_iwslt_de_en  \
--optimizer adam --adam-betas '(0.9, 0.98)'     --lr-scheduler 'inverse_sqrt'     --warmup-init-lr 1e-7  \
--warmup-updates 4000     --lr 5e-4 --stop-min-lr 1e-9 --clip-norm 0.0 --weight-decay 0.0001    \
--dropout 0.3     --label-smoothing 0.1    --max-tokens 3584

Aren't those correct parameters to Train MMA-Hard model on de-en data?

Unfortunately, I can't find my training log at the moment but I started another run just to double check everything. It must be finished in a couple of hours. I'll posted the full training log, inference time log and scores here in a couple of hours once it finishes.

Here is the link for the log of test set evaluation I mentioned above:

instances.log

Training has not finished yet, so I need to wait a bit more to share output but here is the training parameters:

2021-09-20 19:23:51 | INFO | fairseq.tasks.text_to_speech | Please install tensorboardX: pip install tensorboardX
2021-09-20 19:23:53 | INFO | fairseq_cli.train | {'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 
'log_interval': 100, 'log_format': None, 'log_file': None, 'tensorboard_logdir': None, 'wandb_project': None, 'azureml_logging': False, 
'seed': 1, 'cpu': False, 'tpu': False, 'bf16': False, 'memory_efficient_bf16': False, 'fp16': False, 'memory_efficient_fp16': False,
 'fp16_no_flatten_grads': False, 'fp16_init_scale': 128, 'fp16_scale_window': None, 'fp16_scale_tolerance': 0.0,
 'on_cpu_convert_precision': False, 'min_loss_scale': 0.0001, 'threshold_loss_scale': None, 'amp': False, 'amp_batch_retries':
 2, 'amp_init_scale': 128, 'amp_scale_window': None, 'user_dir': None, 'empty_cache_freq': 0, 'all_gather_list_size': 16384,
 'model_parallel_size': 1, 'quantization_config_path': None, 'profile': False, 'reset_logging': False, 'suppress_crashes': False,
 'use_plasma_view': False, 'plasma_path': '/tmp/plasma'}, 'common_eval': {'_name': None, 'path': None, 'post_process': None, 
'quiet': False, 'model_overrides': '{}', 'results_path': None}, 'distributed_training': {'_name': None, 'distributed_world_size': 1,
 'distributed_num_procs': 1, 'distributed_rank': 0, 'distributed_backend': 'nccl', 'distributed_init_method': None,
 'distributed_port': -1, 'device_id': 0, 'distributed_no_spawn': False, 'ddp_backend': 'pytorch_ddp', 'ddp_comm_hook': 'none',
 'bucket_cap_mb': 25, 'fix_batches_to_gpus': False, 'find_unused_parameters': False, 'gradient_as_bucket_view': False,
 'fast_stat_sync': False, 'heartbeat_timeout': -1, 'broadcast_buffers': False, 'slowmo_momentum': None, 'slowmo_algorithm': 'LocalSGD',
 'localsgd_frequency': 3, 'nprocs_per_node': 1, 'pipeline_model_parallel': False, 'pipeline_balance': None, 'pipeline_devices': None, 
'pipeline_chunks': 0, 'pipeline_encoder_balance': None, 'pipeline_encoder_devices': None, 'pipeline_decoder_balance': None,
 'pipeline_decoder_devices': None, 'pipeline_checkpoint': 'never', 'zero_sharding': 'none', 'fp16': False,
 'memory_efficient_fp16': False, 'tpu': False, 'no_reshard_after_forward': False, 'fp32_reduce_scatter': False, 'cpu_offload': False, 
'use_sharded_state': False}, 'dataset': {'_name': None, 'num_workers': 1, 'skip_invalid_size_inputs_valid_test': False,
 'max_tokens': 3584, 'batch_size': None, 'required_batch_size_multiple': 8, 'required_seq_len_multiple': 1, 'dataset_impl': None, 
'data_buffer_size': 10, 'train_subset': 'train', 'valid_subset': 'valid', 'combine_valid_subsets': None,
 'ignore_unused_valid_subsets': False, 'validate_interval': 1, 'validate_interval_updates': 0, 'validate_after_updates': 0,
 'fixed_validation_seed': None, 'disable_validation': False, 'max_tokens_valid': 3584, 'batch_size_valid': None,
 'max_valid_steps': None, 'curriculum': 0, 'gen_subset': 'test', 'num_shards': 1, 'shard_id': 0}, 'optimization': {'_name': None,
 'max_epoch': 0, 'max_update': 75000, 'stop_time_hours': 0.0, 'clip_norm': 0.0, 'sentence_avg': False, 'update_freq': [1], 'lr': [0.0005], 
'stop_min_lr': 1e-09, 'use_bmuf': False}, 'checkpoint': {'_name': None, 'save_dir': 'checkpoints', 'restore_file': 'checkpoint_last.pt', 
'finetune_from_model': None, 'reset_dataloader': False, 'reset_lr_scheduler': False, 'reset_meters': False, 'reset_optimizer': False, 
'optimizer_overrides': '{}', 'save_interval': 1, 'save_interval_updates': 0, 'keep_interval_updates': -1,
 'keep_interval_updates_pattern': -1, 'keep_last_epochs': -1, 'keep_best_checkpoints': -1, 'no_save': False,
 'no_epoch_checkpoints': False, 'no_last_checkpoints': False, 'no_save_optimizer_state': False, 'best_checkpoint_metric': 'loss', 
'maximize_best_checkpoint_metric': False, 'patience': -1, 'checkpoint_suffix': '', 'checkpoint_shard_count': 1,
 'load_checkpoint_on_all_dp_ranks': False, 'write_checkpoints_asynchronously': False, 'model_parallel_size': 1}, 'bmuf':{'_name': None, 'block_lr': 1.0, 'block_momentum': 0.875, 'global_sync_iter': 50, 'warmup_iterations': 500, 'use_nbm': False,
 'average_sync': False, 'distributed_world_size': 1}, 'generation': {'_name': None, 'beam': 5, 'nbest': 1, 'max_len_a': 0.0,
 'max_len_b': 200, 'min_len': 1, 'match_source_len': False, 'unnormalized': False, 'no_early_stop': False, 'no_beamable_mm': False, 
'lenpen': 1.0, 'unkpen': 0.0, 'replace_unk': None, 'sacrebleu': False, 'score_reference': False, 'prefix_size': 0,
 'no_repeat_ngram_size': 0, 'sampling': False, 'sampling_topk': -1, 'sampling_topp': -1.0, 'constraints': None, 
'temperature': 1.0, 'diverse_beam_groups': -1, 'diverse_beam_strength': 0.5, 'diversity_rate': -1.0, 'print_alignment': None,
 'print_step': False, 'lm_path': None, 'lm_weight': 0.0, 'iter_decode_eos_penalty': 0.0, 'iter_decode_max_iter': 10,
 'iter_decode_force_max_iter': False, 'iter_decode_with_beam': 1, 'iter_decode_with_external_reranker': False, 
'retain_iter_history': False, 'retain_dropout': False, 'retain_dropout_modules': None, 'decoding_format': None,
 'no_seed_provided': False}, 'eval_lm': {'_name': None, 'output_word_probs': False, 'output_word_stats': False,
 'context_window': 0, 'softmax_batch': 9223372036854775807}, 'interactive': {'_name': None, 'buffer_size': 0, 'input': '-'},
 'model': Namespace(_name='transformer_monotonic_iwslt_de_en', activation_dropout=0.0, activation_fn='relu',
 adam_betas='(0.9, 0.98)', adam_eps=1e-08, adaptive_input=False, adaptive_softmax_cutoff=None,
 adaptive_softmax_dropout=0.0, adaptive_softmax_factor=4, all_gather_list_size=16384, amp=False, amp_batch_retries=2,
 amp_init_scale=128, amp_scale_window=None, arch='transformer_monotonic_iwslt_de_en', attention_dropout=0.0,
 attention_eps=1e-06, azureml_logging=False, base_layers=0, base_shuffle=1, base_sublayers=1, batch_size=None,
 batch_size_valid=None, best_checkpoint_metric='loss', bf16=False, bpe=None, broadcast_buffers=False,
 bucket_cap_mb=25, char_inputs=False, checkpoint_activations=False, checkpoint_shard_count=1, checkpoint_suffix='',
 clip_norm=0.0, combine_valid_subsets=None, cpu=False, cpu_offload=False,
 criterion='latency_augmented_label_smoothed_cross_entropy', cross_self_attention=False, curriculum=0,
 data='/home/new_data_bin/wmt15.de-en/', data_buffer_size=10, dataset_impl=None, ddp_backend='pytorch_ddp',
 ddp_comm_hook='none', decoder_attention_heads=8, decoder_embed_dim=512, decoder_embed_path=None,
 decoder_ffn_embed_dim=2048, decoder_input_dim=512, decoder_layerdrop=0, decoder_layers=6,
 decoder_layers_to_keep=None, decoder_learned_pos=False, decoder_normalize_before=False, decoder_output_dim=512,
 device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method=None,
 distributed_no_spawn=False, distributed_num_procs=1, distributed_port=-1, distributed_rank=0, distributed_world_size=1,
 dropout=0.3, ema_decay=0.9999, ema_fp32=False, ema_seed_model=None, ema_start_update=0, ema_update_freq=1,
 empty_cache_freq=0, encoder_attention_heads=8, encoder_embed_dim=512, encoder_embed_path=None,
 encoder_ffn_embed_dim=2048, encoder_layerdrop=0, encoder_layers=6, encoder_layers_to_keep=None,
 encoder_learned_pos=False, encoder_normalize_before=False, encoder_unidirectional=False, energy_bias=False,
 energy_bias_init=-2.0, eos=2, eval_bleu=False, eval_bleu_args='{}', eval_bleu_detok='space', eval_bleu_detok_args='{}',
 eval_bleu_print_samples=False, eval_bleu_remove_bpe=None, eval_tokenized_bleu=False, export=False,
 fast_stat_sync=False, find_unused_parameters=False, finetune_from_model=None, fix_batches_to_gpus=False,
 fixed_validation_seed=None, fp16=False, fp16_adam_stats=False, fp16_init_scale=128, fp16_no_flatten_grads=False,
 fp16_scale_tolerance=0.0, fp16_scale_window=None, fp32_reduce_scatter=False, gen_subset='test',
 gradient_as_bucket_view=False, heartbeat_timeout=-1, ignore_prefix_size=0, ignore_unused_valid_subsets=False,
 keep_best_checkpoints=-1, keep_interval_updates=-1, keep_interval_updates_pattern=-1, keep_last_epochs=-1,
 label_smoothing=0.1, latency_avg_type='differentiable_average_lagging', latency_avg_weight=0.0,
 latency_gather_method='weighted_average', latency_update_after=0, latency_var_type='variance_delay',
 latency_var_weight=0.1, layernorm_embedding=False, left_pad_source=False, left_pad_target=False,
 load_alignments=False, load_checkpoint_on_all_dp_ranks=False, localsgd_frequency=3, log_file=None, log_format=None,
 log_interval=100, lr=[0.0005], lr_scheduler='inverse_sqrt', mass_preservation=True, max_epoch=0,
 max_source_positions=1024, max_target_positions=1024, max_tokens=3584, max_tokens_valid=3584,
 max_update=75000, max_valid_steps=None, maximize_best_checkpoint_metric=False, memory_efficient_bf16=False,
 memory_efficient_fp16=False, min_loss_scale=0.0001, min_params_to_wrap=100000000, model_parallel_size=1,
 no_cross_attention=False, no_decoder_final_norm=False, no_epoch_checkpoints=False, no_last_checkpoints=False,
 no_progress_bar=False, no_reshard_after_forward=False, no_save=False, no_save_optimizer_state=False,
 no_scale_embedding=False, no_seed_provided=False, no_token_positional_embeddings=False, noise_mean=0.0,
 noise_type='flat', noise_var=1.0, nprocs_per_node=1, num_batch_buckets=0, num_shards=1, num_workers=1,
 offload_activations=False, on_cpu_convert_precision=False, optimizer='adam', optimizer_overrides='{}', pad=1, patience=-1,
 pipeline_balance=None, pipeline_checkpoint='never', pipeline_chunks=0, pipeline_decoder_balance=None,
 pipeline_decoder_devices=None, pipeline_devices=None, pipeline_encoder_balance=None,
 pipeline_encoder_devices=None, pipeline_model_parallel=False, plasma_path='/tmp/plasma', profile=False,
 quant_noise_pq=0.0, quant_noise_pq_block_size=8, quant_noise_scalar=0.0, quantization_config_path=None,
 relu_dropout=0.0, report_accuracy=False, required_batch_size_multiple=8, required_seq_len_multiple=1,
 reset_dataloader=False, reset_logging=False, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False,
 restore_file='checkpoint_last.pt', save_dir='checkpoints', save_interval=1, save_interval_updates=0, scoring='bleu', seed=1,
 sentence_avg=False, shard_id=0, share_all_embeddings=False, share_decoder_input_output_embed=False,
 simul_type='hard_aligned', skip_invalid_size_inputs_valid_test=False, slowmo_algorithm='LocalSGD',
 slowmo_momentum=None, source_lang=None, stop_min_lr=1e-09, stop_time_hours=0, store_ema=False,
 suppress_crashes=False, target_lang=None, task='translation', tensorboard_logdir=None, threshold_loss_scale=None,
 tie_adaptive_proj=False, tie_adaptive_weights=False, tokenizer=None, tpu=False, train_subset='train',
 truncate_source=False, unk=3, update_freq=[1], upsample_primary=-1, use_bmuf=False, use_old_adam=False,
 use_plasma_view=False, use_sharded_state=False, user_dir=None, valid_subset='valid', validate_after_updates=0, validate_interval=1, validate_interval_updates=0, wandb_project=None, warmup_init_lr=1e-07, warmup_updates=4000,
 weight_decay=0.0001, write_checkpoints_asynchronously=False, zero_sharding='none'), 'task': {'_name': 'translation',
 'data': '/home/new_data_bin/wmt15.de-en/', 'source_lang': None, 'target_lang': None, 'load_alignments': False,
 'left_pad_source': False, 'left_pad_target': False, 'max_source_positions': 1024, 'max_target_positions': 1024,
 'upsample_primary': -1, 'truncate_source': False, 'num_batch_buckets': 0, 'train_subset': 'train', 'dataset_impl': None,
 'required_seq_len_multiple': 1, 'eval_bleu': False, 'eval_bleu_args': '{}', 'eval_bleu_detok': 'space', 'eval_bleu_detok_args':
 '{}', 'eval_tokenized_bleu': False, 'eval_bleu_remove_bpe': None, 'eval_bleu_print_samples': False}, 'criterion': {'_name':
 'latency_augmented_label_smoothed_cross_entropy', 'label_smoothing': 0.1, 'report_accuracy': False, 'ignore_prefix_size':
 0, 'sentence_avg': False, 'latency_avg_weight': 0.0, 'latency_var_weight': 0.1, 'latency_avg_type':
 'differentiable_average_lagging', 'latency_var_type': 'variance_delay', 'latency_gather_method': 'weighted_average',
 'latency_update_after': 0}, 'optimizer': {'_name': 'adam', 'adam_betas': '(0.9, 0.98)', 'adam_eps': 1e-08, 'weight_decay':
 0.0001, 'use_old_adam': False, 'fp16_adam_stats': False, 'tpu': False, 'lr': [0.0005]}, 'lr_scheduler': {'_name': 'inverse_sqrt',
 'warmup_updates': 4000, 'warmup_init_lr': 1e-07, 'lr': [0.0005]}, 'scoring': {'_name': 'bleu', 'pad': 1, 'eos': 2, 'unk': 3}, 'bpe':
 None, 'tokenizer': None, 'ema': {'_name': None, 'store_ema': False, 'ema_decay': 0.9999, 'ema_start_update': 0,
 'ema_seed_model': None, 'ema_update_freq': 1, 'ema_fp32': False}, 'simul_type': 'hard_aligned'}

2021-09-20 19:23:53 | INFO | fairseq.tasks.translation | [de] dictionary: 34880 types
2021-09-20 19:23:53 | INFO | fairseq.tasks.translation | [en] dictionary: 33496 types
2021-09-20 19:23:55 | INFO | fairseq_cli.train | TransformerModelSimulTrans(
  (encoder): TransformerMonotonicEncoder(
    (dropout_module): FairseqDropout()
    (embed_tokens): Embedding(34880, 512, padding_idx=1)
    (embed_positions): SinusoidalPositionalEmbedding()
    (layers): ModuleList(
      (0): TransformerMonotonicEncoderLayer(
        (self_attn): MultiheadAttention(
          (dropout_module): FairseqDropout()
          (k_proj): Linear(in_features=512, out_features=512, bias=True)
          (v_proj): Linear(in_features=512, out_features=512, bias=True)
          (q_proj): Linear(in_features=512, out_features=512, bias=True)
          (out_proj): Linear(in_features=512, out_features=512, bias=True)
        )
        (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (dropout_module): FairseqDropout()
        (activation_dropout_module): FairseqDropout()
        (fc1): Linear(in_features=512, out_features=2048, bias=True)
        (fc2): Linear(in_features=2048, out_features=512, bias=True)
        (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      )
      (1): TransformerMonotonicEncoderLayer(
        (self_attn): MultiheadAttention(
          (dropout_module): FairseqDropout()
          (k_proj): Linear(in_features=512, out_features=512, bias=True)
          (v_proj): Linear(in_features=512, out_features=512, bias=True)
          (q_proj): Linear(in_features=512, out_features=512, bias=True)
          (out_proj): Linear(in_features=512, out_features=512, bias=True)
        )
        (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (dropout_module): FairseqDropout()
        (activation_dropout_module): FairseqDropout()
        (fc1): Linear(in_features=512, out_features=2048, bias=True)
        (fc2): Linear(in_features=2048, out_features=512, bias=True)
        (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      )
      (2): TransformerMonotonicEncoderLayer(
        (self_attn): MultiheadAttention(
          (dropout_module): FairseqDropout()
          (k_proj): Linear(in_features=512, out_features=512, bias=True)
          (v_proj): Linear(in_features=512, out_features=512, bias=True)
          (q_proj): Linear(in_features=512, out_features=512, bias=True)
          (out_proj): Linear(in_features=512, out_features=512, bias=True)
        )
        (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (dropout_module): FairseqDropout()
        (activation_dropout_module): FairseqDropout()
        (fc1): Linear(in_features=512, out_features=2048, bias=True)
        (fc2): Linear(in_features=2048, out_features=512, bias=True)
        (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      )
      (3): TransformerMonotonicEncoderLayer(
        (self_attn): MultiheadAttention(
          (dropout_module): FairseqDropout()
          (k_proj): Linear(in_features=512, out_features=512, bias=True)
          (v_proj): Linear(in_features=512, out_features=512, bias=True)
          (q_proj): Linear(in_features=512, out_features=512, bias=True)
          (out_proj): Linear(in_features=512, out_features=512, bias=True)
        )
        (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (dropout_module): FairseqDropout()
        (activation_dropout_module): FairseqDropout()
        (fc1): Linear(in_features=512, out_features=2048, bias=True)
        (fc2): Linear(in_features=2048, out_features=512, bias=True)
        (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      )
      (4): TransformerMonotonicEncoderLayer(
        (self_attn): MultiheadAttention(
          (dropout_module): FairseqDropout()
          (k_proj): Linear(in_features=512, out_features=512, bias=True)
          (v_proj): Linear(in_features=512, out_features=512, bias=True)
          (q_proj): Linear(in_features=512, out_features=512, bias=True)
          (out_proj): Linear(in_features=512, out_features=512, bias=True)
        )
        (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (dropout_module): FairseqDropout()
        (activation_dropout_module): FairseqDropout()
        (fc1): Linear(in_features=512, out_features=2048, bias=True)
        (fc2): Linear(in_features=2048, out_features=512, bias=True)
        (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      )
      (5): TransformerMonotonicEncoderLayer(
        (self_attn): MultiheadAttention(
          (dropout_module): FairseqDropout()
          (k_proj): Linear(in_features=512, out_features=512, bias=True)
          (v_proj): Linear(in_features=512, out_features=512, bias=True)
          (q_proj): Linear(in_features=512, out_features=512, bias=True)
          (out_proj): Linear(in_features=512, out_features=512, bias=True)
        )
        (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (dropout_module): FairseqDropout()
        (activation_dropout_module): FairseqDropout()
        (fc1): Linear(in_features=512, out_features=2048, bias=True)
        (fc2): Linear(in_features=2048, out_features=512, bias=True)
        (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      )
    )
  )
  (decoder): TransformerMonotonicDecoder(
    (dropout_module): FairseqDropout()
    (embed_tokens): Embedding(33496, 512, padding_idx=1)
    (embed_positions): SinusoidalPositionalEmbedding()
    (layers): ModuleList(
      (0): TransformerMonotonicDecoderLayer(
        (dropout_module): FairseqDropout()
        (self_attn): MultiheadAttention(
          (dropout_module): FairseqDropout()
          (k_proj): Linear(in_features=512, out_features=512, bias=True)
          (v_proj): Linear(in_features=512, out_features=512, bias=True)
          (q_proj): Linear(in_features=512, out_features=512, bias=True)
          (out_proj): Linear(in_features=512, out_features=512, bias=True)
        )
        (activation_dropout_module): FairseqDropout()
        (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MonotonicAttention(
          (dropout_module): FairseqDropout()
          (k_proj): Linear(in_features=512, out_features=512, bias=True)
          (v_proj): Linear(in_features=512, out_features=512, bias=True)
          (q_proj): Linear(in_features=512, out_features=512, bias=True)
          (out_proj): Linear(in_features=512, out_features=512, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=512, out_features=2048, bias=True)
        (fc2): Linear(in_features=2048, out_features=512, bias=True)
        (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      )
      (1): TransformerMonotonicDecoderLayer(
        (dropout_module): FairseqDropout()
        (self_attn): MultiheadAttention(
          (dropout_module): FairseqDropout()
          (k_proj): Linear(in_features=512, out_features=512, bias=True)
          (v_proj): Linear(in_features=512, out_features=512, bias=True)
          (q_proj): Linear(in_features=512, out_features=512, bias=True)
          (out_proj): Linear(in_features=512, out_features=512, bias=True)
        )
        (activation_dropout_module): FairseqDropout()
        (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MonotonicAttention(
          (dropout_module): FairseqDropout()
          (k_proj): Linear(in_features=512, out_features=512, bias=True)
          (v_proj): Linear(in_features=512, out_features=512, bias=True)
          (q_proj): Linear(in_features=512, out_features=512, bias=True)
          (out_proj): Linear(in_features=512, out_features=512, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=512, out_features=2048, bias=True)
        (fc2): Linear(in_features=2048, out_features=512, bias=True)
        (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      )
      (2): TransformerMonotonicDecoderLayer(
        (dropout_module): FairseqDropout()
        (self_attn): MultiheadAttention(
          (dropout_module): FairseqDropout()
          (k_proj): Linear(in_features=512, out_features=512, bias=True)
          (v_proj): Linear(in_features=512, out_features=512, bias=True)
          (q_proj): Linear(in_features=512, out_features=512, bias=True)
          (out_proj): Linear(in_features=512, out_features=512, bias=True)
        )
        (activation_dropout_module): FairseqDropout()
        (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MonotonicAttention(
          (dropout_module): FairseqDropout()
          (k_proj): Linear(in_features=512, out_features=512, bias=True)
          (v_proj): Linear(in_features=512, out_features=512, bias=True)
          (q_proj): Linear(in_features=512, out_features=512, bias=True)
          (out_proj): Linear(in_features=512, out_features=512, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=512, out_features=2048, bias=True)
        (fc2): Linear(in_features=2048, out_features=512, bias=True)
        (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      )
      (3): TransformerMonotonicDecoderLayer(
        (dropout_module): FairseqDropout()
        (self_attn): MultiheadAttention(
          (dropout_module): FairseqDropout()
          (k_proj): Linear(in_features=512, out_features=512, bias=True)
          (v_proj): Linear(in_features=512, out_features=512, bias=True)
          (q_proj): Linear(in_features=512, out_features=512, bias=True)
          (out_proj): Linear(in_features=512, out_features=512, bias=True)
        )
        (activation_dropout_module): FairseqDropout()
        (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MonotonicAttention(
          (dropout_module): FairseqDropout()
          (k_proj): Linear(in_features=512, out_features=512, bias=True)
          (v_proj): Linear(in_features=512, out_features=512, bias=True)
          (q_proj): Linear(in_features=512, out_features=512, bias=True)
          (out_proj): Linear(in_features=512, out_features=512, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=512, out_features=2048, bias=True)
        (fc2): Linear(in_features=2048, out_features=512, bias=True)
        (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      )
      (4): TransformerMonotonicDecoderLayer(
        (dropout_module): FairseqDropout()
        (self_attn): MultiheadAttention(
          (dropout_module): FairseqDropout()
          (k_proj): Linear(in_features=512, out_features=512, bias=True)
          (v_proj): Linear(in_features=512, out_features=512, bias=True)
          (q_proj): Linear(in_features=512, out_features=512, bias=True)
          (out_proj): Linear(in_features=512, out_features=512, bias=True)
        )
        (activation_dropout_module): FairseqDropout()
        (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MonotonicAttention(
          (dropout_module): FairseqDropout()
          (k_proj): Linear(in_features=512, out_features=512, bias=True)
          (v_proj): Linear(in_features=512, out_features=512, bias=True)
          (q_proj): Linear(in_features=512, out_features=512, bias=True)
          (out_proj): Linear(in_features=512, out_features=512, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=512, out_features=2048, bias=True)
        (fc2): Linear(in_features=2048, out_features=512, bias=True)
        (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      )
      (5): TransformerMonotonicDecoderLayer(
        (dropout_module): FairseqDropout()
        (self_attn): MultiheadAttention(
          (dropout_module): FairseqDropout()
          (k_proj): Linear(in_features=512, out_features=512, bias=True)
          (v_proj): Linear(in_features=512, out_features=512, bias=True)
          (q_proj): Linear(in_features=512, out_features=512, bias=True)
          (out_proj): Linear(in_features=512, out_features=512, bias=True)
        )
        (activation_dropout_module): FairseqDropout()
        (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MonotonicAttention(
          (dropout_module): FairseqDropout()
          (k_proj): Linear(in_features=512, out_features=512, bias=True)
          (v_proj): Linear(in_features=512, out_features=512, bias=True)
          (q_proj): Linear(in_features=512, out_features=512, bias=True)
          (out_proj): Linear(in_features=512, out_features=512, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=512, out_features=2048, bias=True)
        (fc2): Linear(in_features=2048, out_features=512, bias=True)
        (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      )
    )
    (output_projection): Linear(in_features=512, out_features=33496, bias=False)
  )
)
2021-09-20 19:23:55 | INFO | fairseq_cli.train | task: TranslationTask
2021-09-20 19:23:55 | INFO | fairseq_cli.train | model: TransformerModelSimulTrans
2021-09-20 19:23:55 | INFO | fairseq_cli.train | criterion: LatencyAugmentedLabelSmoothedCrossEntropyCriterion
2021-09-20 19:23:55 | INFO | fairseq_cli.train | num. shared model params: 96,296,960 (num. trained: 96,296,960)
2021-09-20 19:23:55 | INFO | fairseq_cli.train | num. expert model params: 0 (num. trained: 0)
2021-09-20 19:23:55 | INFO | fairseq.data.data_utils | loaded 40,006 examples from: /home/new_data_bin/wmt15.de-en/valid.de-en.de
2021-09-20 19:23:55 | INFO | fairseq.data.data_utils | loaded 40,006 examples from: /home/new_data_bin/wmt15.de-en/valid.de-en.en
2021-09-20 19:23:55 | INFO | fairseq.tasks.translation | /home/new_data_bin/wmt15.de-en/ valid de-en 40006 examples
2021-09-20 19:23:59 | INFO | fairseq.utils | ***********************CUDA enviroments for all 1 workers***********************
2021-09-20 19:23:59 | INFO | fairseq.utils | rank   0: capabilities =  7.0  ; total memory = 15.782 GB ; name = Tesla V100-SXM2-16GB                    
2021-09-20 19:23:59 | INFO | fairseq.utils | ***********************CUDA enviroments for all 1 workers***********************
2021-09-20 19:23:59 | INFO | fairseq_cli.train | training on 1 devices (GPUs/TPUs)
2021-09-20 19:23:59 | INFO | fairseq_cli.train | max tokens per device = 3584 and max sentences per device = None
2021-09-20 19:23:59 | INFO | fairseq.trainer | Preparing to load checkpoint checkpoints/checkpoint_last.pt
2021-09-20 19:23:59 | INFO | fairseq.trainer | No existing checkpoint found checkpoints/checkpoint_last.pt
2021-09-20 19:23:59 | INFO | fairseq.trainer | loading train data for epoch 1
2021-09-20 19:23:59 | INFO | fairseq.data.data_utils | loaded 3,957,344 examples from: /home/new_data_bin/wmt15.de-en/train.de-en.de
2021-09-20 19:23:59 | INFO | fairseq.data.data_utils | loaded 3,957,344 examples from: /home/new_data_bin/wmt15.de-en/train.de-en.en
2021-09-20 19:23:59 | INFO | fairseq.tasks.translation | /home/new_data_bin/wmt15.de-en/ train de-en 3957344 examples
2021-09-20 19:24:01 | INFO | fairseq.trainer | NOTE: your device may support faster training with --fp16 or --amp
epoch 001:   0%|                                                                                                                | 0/37614 [00:00<?, ?it/s]2021-09-20 19:24:01 | INFO | fairseq.trainer | begin training epoch 1
2021-09-20 19:24:01 | INFO | fairseq_cli.train | Start iterating over samples
/home/fairseq/fairseq/utils.py:373: UserWarning: amp_C fused kernels unavailable, disabling multi_tensor_l2norm; you may get better performance by installing NVIDIA's apex library
  "amp_C fused kernels unavailable, disabling multi_tensor_l2norm; "
epoch 001:  40%|▍| 15139/37614 [1:42:57<2:30:42,  2.49it/s, loss=5.783, nll_loss=4.379, latency=29.218, delays_var=58.819, latency_loss=0, ppl=20.81, wps=

Hi thanks for sharing the command. Sorry we do have some typos in the documentation. I would suggest you run the following command

fairseq-train ./data-bin/wmt15_de_en     
    --simul-type hard_aligned  \
    --mass-preservation    \
    --criterion latency_augmented_label_smoothed_cross_entropy \
    --latency-weight-var  0.1     \
    --max-update 50000    \
    --arch transformer_monotonic_vaswani_wmt_en_de_big  \
    --optimizer adam \
    --adam-betas '(0.9, 0.98)'   \
    --lr-scheduler 'inverse_sqrt'     \
    --warmup-init-lr 1e-7  \
    --warmup-updates 4000 \
    --lr 5e-4 \
    --stop-min-lr 1e-9  \
    --clip-norm 0.0 \
    --weight-decay 0.0001  \
    --dropout 0.3 \
    --label-smoothing 0.1 \
    --max-tokens 3584 \
    --update-freq 64 # add this option since our model was trained on 64 gpus

There are two modifications. 1. a bigger architecture transformer_monotonic_vaswani_wmt_en_de_big 2. --update-freq 64 to simulate larger batch_size with 64 gpus if you only have one gpu. It probably will take more than a couple of hours (as far as I can recall more than 24hrs on 64 gpus) to finish the training.

As to the inference log, the merging subwords looks good to me, but probably you need detokenization. For instance, in "Prime Minister India and Japan are in Tokyo .", units_to_segment need return "Tokyo." instead of "Tokyo" and "." seperately.

Thanks for your prompt reply! I also have access to a node with 4 GPUS. In that case, should I set update-freq to 16?

Thanks for your prompt reply! I also have access to a node with 4 GPUS. In that case, should I set update-freq to 16?

Yes

I have one more question , @xutaima . I got this message:

NOTE: your device may support faster training with --fp16 or --amp"

Did you use --fp16 or --amp flags to support faster training? I mean, does MMA/fairseq code support using these flags or would you suggest me to not to use?

Not very sure about --amp flag. I would suggest not using --fp16 flag because the cumsum function in MMA is very easy to overflow. We try to come up with some work around but haven't fully tested yet. If you need offline fairseq model I would suggest --fp16 flag.

@kurtisxx It seems that your model is still converging.

I don't have a model on my hand at the moment and we used to have an ad hoc method for the inference rather than simuleval. Sure I can retrain the model on my side and debug what bugs are there.

@xutaima do you remember what was model's validation set perplexity after training completes (Also, how many epochs did it take)?

Sure I can retrain the model on my side and debug what bugs are there.

Thank would be great! Thank you! I'd also really appreciate if you share the training/inference code that you're going to use as well.

@xutaima In addition to training a model on your side, could you please also look at this issue: https://github.com/pytorch/fairseq/pull/3894 . As mentioned in the PR, MMA code in fairseq's master branch doesn't work right now (There are run time errors). I am wondering whether there are more bugs other than this.. Also, I've just seen that on Table-4 in the paper [link] the max token is set to 3584 × 8 × 8 × 2. However, both in the readme and in the command that you suggest me above it is set to 3584. So, do I need to change it?

@xutaima , How did the training go in your side? Could you also please let me know what you think about my previous questions (Let me copy them here just to make things easier):

could you please also look at this issue: pytorch/fairseq#3894 . MMA code in fairseq's master branch doesn't work right now (There are run time errors). I am wondering whether there are more bugs other than this (For instance: https://github.com/pytorch/fairseq/issues/3414)
Also, I've just seen that on Table-4 in the paper [link] the max token is set to 3584 × 8 × 8 × 2. However, both in the readme and in the command that you suggest me above it is set to 3584. So, do I need to change it?
Do you remember what was model's validation set perplexity after training completes (Also, how many epochs did it take)?
Currently, tgt_indices inside the policy function is created as follows (you can find the complete policy function above):

        # encode previous predicted target tokens
        tgt_indices = self.to_device(
            torch.LongTensor(
                [self.model.decoder.dictionary.eos()]
                + [
                    self.dict['tgt'].index(x)
                    for x in states.units.target.value
                    if x is not None
                ]
            ).unsqueeze(0)
        )

        x, outputs = self.model.decoder.forward(
            prev_output_tokens=tgt_indices,
            encoder_out=states.encoder_states,
            incremental_state=states.incremental_states,
        )

Basically, I've copied this part from EN-JA inference time code, as you suggested. However, I didn't understand why eos token from decoder's dictionary is prefixed to the tgt_indices list?

For MMA-Hard model, monotonic_attention_process_infer function takes incremental_state as one of its argument but it is not used anywhere inside this function. Is this a bug or intended?

I've already spent a good amount of time and also money on this. I'm really looking forward to issues being solved as quick as possible and we can replicate the results reported in the paper..

@xutaima, Kindly remind you that I've been waiting an answer from you for 4 days. If you're not interested in fixing bugs in MMA code and also not interested to provide people a code which they can actually use and replicate your results, please let me know. In that case, I can get in touch with co-authors and/or other people in fairseq and ask if there is a way to make this project run.

@kurtisxx thank you for your patience. We're working towards a deadline so we don't have much bandwidth at the moment. In general, this kind of OSS project is provided as is with no guarantee of support. However, we will try to address issues on a best effort basis. Thank you for your understanding.

@kurtisxx thank you for your patience. We're working towards a deadline so we don't have much bandwidth at the moment. In general, this kind of OSS project is provided as is with no guarantee of support. However, we will try to address issues on a best effort basis. Thank you for your understanding.

@jmp84 Thank you for your prompt reply and I wish you good luck with your deadlines. Nevertheless, I'd just want to make something clear. I am not asking you to update the MMA repository to make it compatible with current fairseq. I am okay with any working combination of fairseq & MMA. Unfortunately, I doubt there is such a combination, as I think MMA code was not working even when it was first released because of some functions (e.g. Link) that are called but not defined.

I don't think that it aligns well with footnote-1 in the paper which says "The code is available at https://github.com/pytorch/fairseq/tree/master/examples/simultaneous_translation"

@kurtisxx， I am also a user of FairseqMMA and I would like to ask if you have found a feasible way to reproduce its results

I am playing with the MMA-hard model to replicate WMT15 DE-EN experiments reported in the paper and my question is about preprocessing and postprocessing data. The paper says that:

For each dataset, we apply tokenization with the Moses (Koehn et al., 2007) tokenizer and preserve casing. We apply byte pair encoding (BPE) (Sennrich et al., 2016) jointly on the source and target to construct a shared vocabulary with 32K symbols

Following what is said above, I applied moses scripts to tokenize raw files and applied BPE to the tokenized files. Then, tokenized and BPE applied train, val and test files were binarized using following fairseq preprocess command:
fairseq-preprocess --source-lang de --target-lang en \
    --trainpref ~/wmt15_de_en_32k/train --validpref ~/wmt15_de_en_32k/valid --testpref ~/wmt15_de_en_32k/test \
    --destdir ~/wmt15_de_en_32k/data-bin/ \
    --workers 20
Afer that, I trained a MMA-hard model using the binarized data. Now, I would like to evaluate (w.r.t. Latency and Bleu) a checkpoint using SimulEval. My first question is about the file format: Which format should I provide the test files as --source and --target to simuleval command? There are three options as far as I can see:

Using Raw files.

Using tokenized files

Using tokenized and bpe applied files.

I am following EN-JA waitk model's agent file to understand what should be done. However, the difference between the experiment I'd like to replicate and EN-JA experiment is that in EN-JA sentencepiece model is used for tokenization whereas in my case moses is used and also bpe is applied.

So, I tried following:

I provided path of TOKENIZED files as --source and --target to simuleval. Also, I've implemented segment_to_units and build_word_splitter functions as follows but I couldn't figure out how I should implement units_to_segment.

I tried to test this implementation as follows:
$ head -n 1 ~/wmt15_de_en_32k/tmp/test.de
Die Premierminister Indiens und Japans trafen sich in Tokio .
$ head -n 1 ~/wmt15_de_en_32k/tmp/test.en
India and Japan prime ministers meet in Tokyo

simuleval --agent mma-dummy/mmaAgent.py --source ~/wmt15_de_en_32k/tmp/test.de  \
--target  ~/wmt15_de_en_32k/tmp/test.en  --data-bin ~/wmt15_de_en_32k/data-bin/  \
--model-path ~/checkpoints/checkpoint_best.pt --bpe_code ~/wmt15_de_en_32k/code
So, my questions are:

Is it correct to provide tokenized but not bpe applied test files as --source and --target to simuleval?

Do implementations of segment_to_units and build_word_splitter functions seem correct?

Could you please explain how units_to_segment and update_states_write should be implemented?

Edit: When I evaluate the best checkpoint on a subset of test-set using the above code I got the following output:

2021-09-19 22:10:08 | WARNING | sacrebleu | That's 100 lines that end in a tokenized period ('.') 2021-09-19 22:10:08 | WARNING | sacrebleu | It looks like you forgot to detokenize your test data, which may hurt your score. 2021-09-19 22:10:08 | WARNING | sacrebleu | If you insist your data is detokenized, or don't care, you can suppress this message with '--force'. 2021-09-19 22:10:08 | INFO | simuleval.cli | Evaluation results: { "Quality": { "BLEU": 6.068334932433579 }, "Latency": { "AL": 7.8185020314753055, "AP": 0.833324143320322, "DAL": 11.775593814849854 } }

Hello @kurtisxx , can you share the mma-dummy/mmaAgent.py file? Thank you very much

I don't have a model on my hand at the moment and we used to have an ad hoc method for the inference rather than simuleval. Sure I can retrain the model on my side and debug what bugs are there.

What is the ad hoc method ?

facebookresearch / SimulEval

Pre- and post-processing text in Simuleval #18