The problem I had with m4t_evaluate is that the inputs it expects are different from those of m4t_finetune and what m4t_prepare_dataset generates. Evaluate expects a TSV file and directories while both the others work with a manifest JSON file.
I have changed the m4t_evaluate to accept the same inputs as the other two, i.e. a manifest JSON file.
Summary
The --data_file input can now be either a TSV path or a manifest path.
Also, --output_path is now required because the CLI fails otherwise.
Here is a sample run with just 3 lines in a manifest JSON file
▶ m4t_evaluate --data_file build/fleurs/test_manifest_1.json --task ASR --tgt_lang eng --output_path build/fleurs
Using the cached tokenizer of seamlessM4T_v2_large. Set `force` to `True` to download again.
Using the cached checkpoint of seamlessM4T_v2_large. Set `force` to `True` to download again.
2024-03-27 03:11:10,791 INFO -- seamless_communication.cli.m4t.evaluate.evaluate: text_generation_opts=SequenceGeneratorOptions(beam_size=5, soft_max_seq_len=(1, 200), hard_max_seq_len=1024, step_processor=None, unk_penalty=0.0, len_penalty=1.0)
2024-03-27 03:11:10,797 INFO -- seamless_communication.cli.m4t.evaluate.evaluate: unit_generation_opts=SequenceGeneratorOptions(beam_size=5, soft_max_seq_len=(25, 50), hard_max_seq_len=1024, step_processor=None, unk_penalty=0.0, len_penalty=1.0)
2024-03-27 03:11:10,797 INFO -- seamless_communication.cli.m4t.evaluate.evaluate: unit_generation_ngram_filtering=False
2024-03-27 03:11:10,801 INFO -- seamless_communication.cli.m4t.evaluate.evaluate: Running inference on device=device(type='cpu') with dtype=torch.float32, ctx.batch_size=4.
3it [00:26, 8.73s/it]
2024-03-27 03:11:37,123 INFO -- seamless_communication.cli.m4t.evaluate.evaluate: Processed 3 samples
2024-03-27 03:11:37,156 INFO -- seamless_communication.cli.eval_utils.compute_metrics: ASR : {
"name": "WER",
"score": 0.16666666666666666,
"signature": "wer is 0.16666666666666666"
}
The problem I had with
m4t_evaluate
is that the inputs it expects are different from those ofm4t_finetune
and whatm4t_prepare_dataset
generates. Evaluate expects a TSV file and directories while both the others work with a manifest JSON file.I have changed the
m4t_evaluate
to accept the same inputs as the other two, i.e. a manifest JSON file.Summary
--data_file
input can now be either a TSV path or a manifest path.--output_path
is now required because the CLI fails otherwise.Here is a sample run with just 3 lines in a manifest JSON file