facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.22k stars 6.38k forks source link

[MMS] the MMS asr infer with multiple audio infer, the order of the output log is not right #5152

Closed didadida-r closed 1 year ago

didadida-r commented 1 year ago

🐛 Bug

the MMS asr infer with multiple audio infer, the order of the output log is not right

To Reproduce

Steps to reproduce the behavior (always include the command you ran):

  1. Run cmd '....' python -u examples/mms/asr/infer/mms_infer.py \ --model pwd/../fairseq_resource/mms1b_all.pt \ --lang "eng" \ --audio audio1.wav audio2.wav audio3.wav ... audio10.wav

  2. See error the text file in order

    1089-134686-0000 HE HOPED THERE WOULD BE STEW FOR DINNER TURNIPS AND CARROTS AND BRUISED POTATOES AND FAT MUTTON PIECES TO BE LADLED OUT IN THICK PEPPERED FLOUR FATTENED SAUCE
    1089-134686-0001 STUFF IT INTO YOU HIS BELLY COUNSELLED HIM
    1089-134686-0002 AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID QUARTER OF THE BROTHELS
    1089-134686-0003 HELLO BERTIE ANY GOOD IN YOUR MIND
    1089-134686-0004 NUMBER TEN FRESH NELLY IS WAITING ON YOU GOOD NIGHT HUSBAND
    1089-134686-0005 THE MUSIC CAME NEARER AND HE RECALLED THE WORDS THE WORDS OF SHELLEY'S FRAGMENT UPON THE MOON WANDERING COMPANIONLESS PALE FOR WEARINESS
    1089-134686-0006 THE DULL LIGHT FELL MORE FAINTLY UPON THE PAGE WHEREON ANOTHER EQUATION BEGAN TO UNFOLD ITSELF SLOWLY AND TO SPREAD ABROAD ITS WIDENING TAIL
    1089-134686-0007 A COLD LUCID INDIFFERENCE REIGNED IN HIS SOUL
    1089-134686-0008 THE CHAOS IN WHICH HIS ARDOUR EXTINGUISHED ITSELF WAS A COLD INDIFFERENT KNOWLEDGE OF HIMSELF
    1089-134686-0009 AT MOST BY AN ALMS GIVEN TO A BEGGAR WHOSE BLESSING HE FLED FROM HE MIGHT HOPE WEARILY TO WIN FOR HIMSELF SOME MEASURE OF ACTUAL GRACE

the mms infer log, disorder

/Dataset/speech/english/test/libri_test_other/wav/wav/1/wav/1/1089-134686-0000.wav
/Dataset/speech/english/test/libri_test_other/wav/wav/1/wav/1/1089-134686-0001.wav
/Dataset/speech/english/test/libri_test_other/wav/wav/1/wav/1/1089-134686-0002.wav
/Dataset/speech/english/test/libri_test_other/wav/wav/1/wav/1/1089-134686-0003.wav
/Dataset/speech/english/test/libri_test_other/wav/wav/1/wav/1/1089-134686-0004.wav
/Dataset/speech/english/test/libri_test_other/wav/wav/1/wav/1/1089-134686-0005.wav
/Dataset/speech/english/test/libri_test_other/wav/wav/1/wav/1/1089-134686-0006.wav
/Dataset/speech/english/test/libri_test_other/wav/wav/1/wav/1/1089-134686-0007.wav
/Dataset/speech/english/test/libri_test_other/wav/wav/1/wav/1/1089-134686-0008.wav
/Dataset/speech/english/test/libri_test_other/wav/wav/1/wav/1/1089-134686-0009.wav
>>> loading model & running inference ...
===============
Input: /Dataset/speech/english/test/libri_test_other/wav/wav/1/wav/1/1089-134686-0000.wav
Output: at most by an alms given to a beggar whose blessing he fled from he might hope wearily to win for himself some measure of actual grace
===============
Input: /Dataset/speech/english/test/libri_test_other/wav/wav/1/wav/1/1089-134686-0001.wav
Output: the dull light fell more faintly upon the page whereon another equation began to unfold itself slowly and to spread abroad its widening tail
===============
Input: /Dataset/speech/english/test/libri_test_other/wav/wav/1/wav/1/1089-134686-0002.wav
Output: he hoped there would be stew for dinner turnips and carrots and bruised potatoes and fat mutton pieces to be ladled out in thick peppered flowrfattened sauce
===============
Input: /Dataset/speech/english/test/libri_test_other/wav/wav/1/wav/1/1089-134686-0003.wav
Output: the music came nearer and he recalled the words the words of shelley's fragment upon the moon wandering companionless pale for weariness
===============
Input: /Dataset/speech/english/test/libri_test_other/wav/wav/1/wav/1/1089-134686-0004.wav
Output: the chaos in which his ardour extinguished itself was a cold indifferent knowledge of himself
===============
Input: /Dataset/speech/english/test/libri_test_other/wav/wav/1/wav/1/1089-134686-0005.wav
Output: after early nightfall the yellow lamps would light up here and there the squalid quarter of the brothles
===============
Input: /Dataset/speech/english/test/libri_test_other/wav/wav/1/wav/1/1089-134686-0006.wav
Output: number ten fresh nelly is waiting on you good-night husband
===============
Input: /Dataset/speech/english/test/libri_test_other/wav/wav/1/wav/1/1089-134686-0007.wav
Output: a cold lucid indifference reigned in his soul
===============
Input: /Dataset/speech/english/test/libri_test_other/wav/wav/1/wav/1/1089-134686-0008.wav
Output: stuff it into you his belly counselled him
===============
Input: /Dataset/speech/english/test/libri_test_other/wav/wav/1/wav/1/1089-134686-0009.wav
Output: allo berti any good in your mind

Code sample

Expected behavior

Environment

Additional context

vineelpratap commented 1 year ago

Hi, this is fixed in - https://github.com/facebookresearch/fairseq/pull/5149 .