Closed kwasnydam closed 2 years ago
yes is true, it's hard to create similar process, because no clue at all... to reproduce the similar process with another dataset.
already two weeks learn how to do it. still cannot figure all process...
what I know:
in kenlm module we still can get clue how to create binary file, but... in finetuning we must use several file like:
Hey, you can refer to this page for dict.ltr.txt, lexicon and kenlm. Note: In that page dict.ltr.txt is referred as token.txt.
And yeah you need a tsv file for both train and valid and create .ltr
and .wrd
files that are in format used by model to start fine-tuning your model. You can refer to libri_labels.py for that.
I mean, that's what I am talking about, instead of having the users read through the whole libri_labels.py
script, which is made so that automatic generation of necessary files for that specific dataset works, just provide these few lines in the readme for wav2vec2.0 or scatter them across the relevant scripts (correct me if I got the format wrong, cause still, not everything has been working correctly for me yet).
To run the audio pretraining, prepare the `train.tsv` and `valid.txt` files, containing lines:
<path/to/the/audio/file/in/supported/format>\t<audio_len_in_samples>
You can use the wav2vec_manifest.py script to get tsv files for an arbitrary dataset.
To run the ctc fine-tuning, prepare the following files: dict.ltr.txt (check example here:<link>) {train,valid}.wrd, {train,valid}.ltr
.wrd file should contain transcriptions corresponding to the files specified in {train,valid}.tsv in the same order.
.ltr file should contain lines from .wrd file where each grapheme is divided by a space, words are divided by '|' and there is | at the end of the sentence
example .wrd file
i didn't know shepherds knew how to read said a girl's voice behind him
do i knock or something
it's extremely suspicious that there's no information about brains that didn't come from a brain
that was his work
example .ltr file
i | d i d n ' t | k n o w | s h e p h e r d s | k n e w | h o w | t o | r e a d | s a i d | a | g i r l ' s | v o i c e | b e h i n d | h i m |
d o | i | k n o c k | o r | s o m e t h i n g |
i t ' s | e x t r e m e l y | s u s p i c i o u s | t h a t | t h e r e ' s | n o | i n f o r m a t i o n | a b o u t | b r a i n s | t h a t | d i d n ' t | c o m e | f r o m | a | b r a i n |
t h a t | w a s | h i s | w o r k |
To decode with LM, check: https://github.com/facebookresearch/wav2letter/wiki/Data-Preparation
Hey, you can refer to this page for dict.ltr.txt, lexicon and kenlm. Note: In that page dict.ltr.txt is referred as token.txt.
And yeah you need a tsv file for both train and valid and create
.ltr
and.wrd
files that are in format used by model to start fine-tuning your model. You can refer to libri_labels.py for that.
thank's bro..., never know that before...
by the way, the road is still a looong way to go, did I should ask if I got problem in further step? and is that not make the issue section full? and hard to find the problem that similar...
so we can minimize the question by make the step by step tutorial to do it...
example, (already asking in short time...), file dict.ltr.txt contains number... in here not explain anything about number... what that number how to get that number just a number of the character in .wrd
or .ltr
or summing the number of the character from both file? or not the number of character, i mean something else?
a little question but need time to investigate what kind of is it...
| 94802
E 51860
T 38431
A 33152
O 31495
N 28855
I 28794
H 27187
S 26071
R 23546
D 18289
L 16308
U 12400
M 10685
W 10317
C 9844
F 9062
G 8924
Y 8226
P 6890
B 6339
V 3936
K 3456
' 1023
X 636
J 598
Q 437
Z 213
Hi, This issue #2514 may help you :)
By the way, I kinda managed to put everything in place and I have working inference and training pipelines, maybe I will put some pull request with the relevant docs that would help the future users.
One new thing that bothers me is that the dict.ltr.txt
is being looked for in the dataset directory also on the inference, which caused me 1 hour of debugging because my inference directory contained the dict file that was compatible with the previous version of the model, not the one I was testing (the order of labels was wrong). So first of all, during the inference examples/speech_recognition/infer.py
, the code should search for the labels file in the --path
directory, not in the dataset directory, and also the dict.ltr.txt
should be copied to the model save directory right at the beginning of training, as it is inherently a part of the model and it won't work correctly unless a properly-ordered label file with all the labels is supplied. The second thing is, in case, as mentioned here: https://github.com/pytorch/fairseq/issues/2514, the count does not really matter, then I think it would be more intuitive to just provide the dictionary in ASCII/UTF-8 ordering, that way it will stay the same between different datasets instead of changing order when the counts are different.
By the way, I kinda managed to put everything in place and I have working inference and training pipelines, maybe I will put some pull request with the relevant docs that would help the future users.
One new thing that bothers me is that the
dict.ltr.txt
is being looked for in the dataset directory also on the inference, which caused me 1 hour of debugging because my inference directory contained the dict file that was compatible with the previous version of the model, not the one I was testing (the order of labels was wrong). So first of all, during the inferenceexamples/speech_recognition/infer.py
, the code should search for the labels file in the--path
directory, not in the dataset directory, and also thedict.ltr.txt
should be copied to the model save directory right at the beginning of training, as it is inherently a part of the model and it won't work correctly unless a properly-ordered label file with all the labels is supplied. The second thing is, in case, as mentioned here: #2514, the count does not really matter, then I think it would be more intuitive to just provide the dictionary in ASCII/UTF-8 ordering, that way it will stay the same between different datasets instead of changing order when the counts are different.
As a new researcher in this field, it would be great if you can put everything in one place. I'm struggle with this wav2vec for a couple of weeks now.
Thank you in advance.
By the way, I kinda managed to put everything in place and I have working inference and training pipelines, maybe I will put some pull request with the relevant docs that would help the future users. One new thing that bothers me is that the
dict.ltr.txt
is being looked for in the dataset directory also on the inference, which caused me 1 hour of debugging because my inference directory contained the dict file that was compatible with the previous version of the model, not the one I was testing (the order of labels was wrong). So first of all, during the inferenceexamples/speech_recognition/infer.py
, the code should search for the labels file in the--path
directory, not in the dataset directory, and also thedict.ltr.txt
should be copied to the model save directory right at the beginning of training, as it is inherently a part of the model and it won't work correctly unless a properly-ordered label file with all the labels is supplied. The second thing is, in case, as mentioned here: #2514, the count does not really matter, then I think it would be more intuitive to just provide the dictionary in ASCII/UTF-8 ordering, that way it will stay the same between different datasets instead of changing order when the counts are different.As a new researcher in this field, it would be great if you can put everything in one place. I'm struggle with this wav2vec for a couple of weeks now.
Thank you in advance.
Hey, for starters I think the guy in this answer got a lot of things right, especially when it comes to the proper dataset formatting so I think you could try starting here: https://github.com/pytorch/fairseq/issues/2493#issuecomment-719915281
This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!
Closing this issue after a prolonged period of inactivity. If this issue is still present in the latest release, please create a new issue with up-to-date information. Thank you!
🚀 Feature Request
Provide an example line formatted in a way expected by the training/finetuning scripts. Like:
Motivation
Right now you have to reason about the whole data preparation scripts to grasp what is the expected format, and they only work for LibriSpeech anyway: https://github.com/pytorch/fairseq/tree/master/examples/wav2vec. I mean, I still don't get it what files are necessary to run the fine-tuning and how each file looks like.
Pitch
Adding like 3 lines of documentation in each preprocessing script or README is not something too hard I imagine.