facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.22k stars 6.38k forks source link

wav2vec 2.0 - provide example dataset line in data prep scripts #2819

Closed kwasnydam closed 2 years ago

kwasnydam commented 3 years ago

🚀 Feature Request

Provide an example line formatted in a way expected by the training/finetuning scripts. Like:

file_a
file_a_line_format

file_b
file_b_line_format

Motivation

Right now you have to reason about the whole data preparation scripts to grasp what is the expected format, and they only work for LibriSpeech anyway: https://github.com/pytorch/fairseq/tree/master/examples/wav2vec. I mean, I still don't get it what files are necessary to run the fine-tuning and how each file looks like.

Pitch

Adding like 3 lines of documentation in each preprocessing script or README is not something too hard I imagine.

wahyubram82 commented 3 years ago

yes is true, it's hard to create similar process, because no clue at all... to reproduce the similar process with another dataset.

already two weeks learn how to do it. still cannot figure all process...

what I know:

  1. create a new dataset, just many wav file 16k 1 channel (mono), with lenght 2 - 15 s. download from youtube, I use ytmp3 to download in my language (indonesian) then use python to convert and chunk it base silence with set the treshhold. I add mozilla common voice in speech dataset (convert it to wav 16k, 1 channel), just 8 hrs speech dataset but still not clear how many hours that needed to make good result. for someone that do this for hobby, it's like not fun anymore i think to create 58k hours dataset. with one weeks hard work, only resulting 68 hrs dataset.
  2. use wav2vec_manifest.py to create tsv file as a direction file to point where the wav file exist.
  3. train it...create pre-trained model,
  4. finetuning step... here the problem start. to finetuning, i got clue to create another tsv file from libri_labels.py, we still have a tutorial until here. but to train with kenlm 4-gram language model. that still unclear about:

in kenlm module we still can get clue how to create binary file, but... in finetuning we must use several file like:

  1. a letter dictionary, how to make dict.ltr.txt? because librispeech doesn't have my language...indonesian language...
  2. a lexicon file, how to make lexicon file? it's looks like or similar with kaldi lexicon I think, did you use kaldi to create lexicon file?
  3. there is option --lm-weights, but base on libri... if i use my own dataset how to know the lm weights? start from here I'm stuck... step by step tutorial, how to reproduce file that needed and the format of data like @kwasnydam said, it's needed...
amant555 commented 3 years ago

Hey, you can refer to this page for dict.ltr.txt, lexicon and kenlm. Note: In that page dict.ltr.txt is referred as token.txt.

And yeah you need a tsv file for both train and valid and create .ltr and .wrd files that are in format used by model to start fine-tuning your model. You can refer to libri_labels.py for that.

kwasnydam commented 3 years ago

I mean, that's what I am talking about, instead of having the users read through the whole libri_labels.py script, which is made so that automatic generation of necessary files for that specific dataset works, just provide these few lines in the readme for wav2vec2.0 or scatter them across the relevant scripts (correct me if I got the format wrong, cause still, not everything has been working correctly for me yet).

To run the audio pretraining, prepare the `train.tsv` and `valid.txt` files, containing lines:
<path/to/the/audio/file/in/supported/format>\t<audio_len_in_samples>
You can use the wav2vec_manifest.py script to get tsv files for an arbitrary dataset. 
To run the ctc fine-tuning, prepare the following files: dict.ltr.txt (check example here:<link>) {train,valid}.wrd, {train,valid}.ltr
.wrd file should contain transcriptions corresponding to the files specified in {train,valid}.tsv in the same order.
.ltr file should contain lines from .wrd file where each grapheme is divided by a space, words are divided by '|' and there is | at the end of the sentence

example .wrd file

i didn't know shepherds knew how to read said a girl's voice behind him
do i knock or something
it's extremely suspicious that there's no information about brains that didn't come from a brain
that was his work

example .ltr  file
i | d i d n ' t | k n o w | s h e p h e r d s | k n e w | h o w | t o | r e a d | s a i d | a | g i r l ' s | v o i c e | b e h i n d | h i m |
d o | i | k n o c k | o r | s o m e t h i n g |
i t ' s | e x t r e m e l y | s u s p i c i o u s | t h a t | t h e r e ' s | n o | i n f o r m a t i o n | a b o u t | b r a i n s | t h a t | d i d n ' t | c o m e | f r o m | a | b r a i n |
t h a t | w a s | h i s | w o r k |

To decode with LM, check: https://github.com/facebookresearch/wav2letter/wiki/Data-Preparation

wahyubram82 commented 3 years ago

Hey, you can refer to this page for dict.ltr.txt, lexicon and kenlm. Note: In that page dict.ltr.txt is referred as token.txt.

And yeah you need a tsv file for both train and valid and create .ltr and .wrd files that are in format used by model to start fine-tuning your model. You can refer to libri_labels.py for that.

thank's bro..., never know that before...

by the way, the road is still a looong way to go, did I should ask if I got problem in further step? and is that not make the issue section full? and hard to find the problem that similar...

so we can minimize the question by make the step by step tutorial to do it...

example, (already asking in short time...), file dict.ltr.txt contains number... in here not explain anything about number... what that number how to get that number just a number of the character in .wrd or .ltror summing the number of the character from both file? or not the number of character, i mean something else? a little question but need time to investigate what kind of is it...

| 94802
E 51860
T 38431
A 33152
O 31495
N 28855
I 28794
H 27187
S 26071
R 23546
D 18289
L 16308
U 12400
M 10685
W 10317
C 9844
F 9062
G 8924
Y 8226
P 6890
B 6339
V 3936
K 3456
' 1023
X 636
J 598
Q 437
Z 213
guiruli08650129 commented 3 years ago

Hi, This issue #2514 may help you :)

kwasnydam commented 3 years ago

By the way, I kinda managed to put everything in place and I have working inference and training pipelines, maybe I will put some pull request with the relevant docs that would help the future users.

One new thing that bothers me is that the dict.ltr.txt is being looked for in the dataset directory also on the inference, which caused me 1 hour of debugging because my inference directory contained the dict file that was compatible with the previous version of the model, not the one I was testing (the order of labels was wrong). So first of all, during the inference examples/speech_recognition/infer.py, the code should search for the labels file in the --path directory, not in the dataset directory, and also the dict.ltr.txt should be copied to the model save directory right at the beginning of training, as it is inherently a part of the model and it won't work correctly unless a properly-ordered label file with all the labels is supplied. The second thing is, in case, as mentioned here: https://github.com/pytorch/fairseq/issues/2514, the count does not really matter, then I think it would be more intuitive to just provide the dictionary in ASCII/UTF-8 ordering, that way it will stay the same between different datasets instead of changing order when the counts are different.

hongthana commented 3 years ago

By the way, I kinda managed to put everything in place and I have working inference and training pipelines, maybe I will put some pull request with the relevant docs that would help the future users.

One new thing that bothers me is that the dict.ltr.txt is being looked for in the dataset directory also on the inference, which caused me 1 hour of debugging because my inference directory contained the dict file that was compatible with the previous version of the model, not the one I was testing (the order of labels was wrong). So first of all, during the inference examples/speech_recognition/infer.py, the code should search for the labels file in the --path directory, not in the dataset directory, and also the dict.ltr.txt should be copied to the model save directory right at the beginning of training, as it is inherently a part of the model and it won't work correctly unless a properly-ordered label file with all the labels is supplied. The second thing is, in case, as mentioned here: #2514, the count does not really matter, then I think it would be more intuitive to just provide the dictionary in ASCII/UTF-8 ordering, that way it will stay the same between different datasets instead of changing order when the counts are different.

As a new researcher in this field, it would be great if you can put everything in one place. I'm struggle with this wav2vec for a couple of weeks now.

Thank you in advance.

kwasnydam commented 3 years ago

By the way, I kinda managed to put everything in place and I have working inference and training pipelines, maybe I will put some pull request with the relevant docs that would help the future users. One new thing that bothers me is that the dict.ltr.txt is being looked for in the dataset directory also on the inference, which caused me 1 hour of debugging because my inference directory contained the dict file that was compatible with the previous version of the model, not the one I was testing (the order of labels was wrong). So first of all, during the inference examples/speech_recognition/infer.py, the code should search for the labels file in the --path directory, not in the dataset directory, and also the dict.ltr.txt should be copied to the model save directory right at the beginning of training, as it is inherently a part of the model and it won't work correctly unless a properly-ordered label file with all the labels is supplied. The second thing is, in case, as mentioned here: #2514, the count does not really matter, then I think it would be more intuitive to just provide the dictionary in ASCII/UTF-8 ordering, that way it will stay the same between different datasets instead of changing order when the counts are different.

As a new researcher in this field, it would be great if you can put everything in one place. I'm struggle with this wav2vec for a couple of weeks now.

Thank you in advance.

Hey, for starters I think the guy in this answer got a lot of things right, especially when it comes to the proper dataset formatting so I think you could try starting here: https://github.com/pytorch/fairseq/issues/2493#issuecomment-719915281

stale[bot] commented 3 years ago

This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!

stale[bot] commented 2 years ago

Closing this issue after a prolonged period of inactivity. If this issue is still present in the latest release, please create a new issue with up-to-date information. Thank you!