Open fakufaku opened 3 years ago
@fakufaku Sorry for my late response. I wasn't checking issues in this repository frequently.
It seems that the branch wsj1_mix_spatialized contains the code to do so in the recipe egs/wsj1_mix_spatialized/asr1.
Yes, this branch is intended for reproduction of the results in that paper.
After struggling a bit, I have managed to create the necessary data, but now I am running into some problems when running stage 2 of run.sh. In particular, the file data/tr_spatialized_anechoic_multich/data.json is not found. For the equivalent reverb folder it was generated during stage 1 I believe.
I assume the following directories should be generated after Stage 1:
$ ls data/
cv tr_spatialized_reverb_multich
cv_spatialized_anechoic_multich tt
cv_spatialized_reverb_multich tt_spatialized_anechoic_multich
tr tt_spatialized_reverb_multich
tr_spatialized_anechoic_multich wsj
But, is the anechoic data also used during training ?
Yes. Both anechoic and reverberant versions of the training data are used to improve the performance.
Also, is there a more recent version of the recipe somewhere ?
For this paper, I don't have a updated recipe. But if you are interested, I have a new branch numerical_stability
for the followup paper "End-to-End Dereverberation, Beamforming, and Speech Recognition with Improved Numerical Stability and Advanced Frontend".
You can find the corresponding recipe in https://github.com/Emrys365/espnet/tree/numerical_stability/egs/wsj1_mix_spatialized/asr1.
That branch is not documented yet. But the data is the same as in the current branch.
After some checks, I did miss a few lines of code in Stage 2 for preparing the required data. You can quickly fix the issue by replacing egs/wsj1_mix_spatialized/asr1/run.sh#L164 with:
for setname in tr_spatialized_reverb_multich tr_spatialized_anechoic_multich cv_spatialized_reverb_multich cv_spatialized_anechoic_multich tt_spatialized_reverb_multich tt_spatialized_anechoic_multich; do
Hi @Emrys365 thank you very much for the reply! I had paused this a little bit, will try to get back to it now 😄
I have managed to produce the data and train a model. The graph from the training are as follows.
Accuracy
Loss
However, at test time, I obtain values around 60% and 80% for the CER and WER so something seems wrong.
I should mention that I have also merged the branch with the latest version of ESPnet.
Do the training accuracy and loss look similar to what you have obtained in your experiments ?
Thanks a lot for your help!
Oh that looks like not working.
Could you tell me some more details about your experiments?
BTW, I am also not sure whether the performance will be similar when my code is ported to the latest ESPnet. Sometimes the ASR performance can be different across different versions even with the same config (I don't expect the difference to be too large though). So it is recommended to use a similar version as I used if you want to reproduce it.
Thanks for the quick reply! Which parts hints that it is not working ? Is it only the poor CER/WER, or can you tell by the graphs too ? What are expected values to reach for accuracy (main/validation) and loss ?
The reason I have upgraded to the latest ESPNet is that I want to try some custom frontend based on the more recent pytorch versions (1.8+) with native complex type support.
Which parts hints that it is not working ? Is it only the poor CER/WER, or can you tell by the graphs too ?
According to the curves, there is a large gap between the training and validation curves. So it means the model is likely to be overtrained, which may be due to numerical stability issues or other problems.
What are expected values to reach for accuracy (main/validation) and loss ?
For the WPE+MVDR model, I would expect validation/main/acc
to be over 90%, while main/loss
and validation/main/loss
to be lower than 50.
The reason I have upgraded to the latest ESPNet is that I want to try some custom frontend based on the more recent pytorch versions (1.8+) with native complex type support.
I see. Then I will recommend you to try with a numerically more stable implementation in the branch numerical_stability
, as I mentioned above.
I see, thank you very much, this precious information 😄
Re-reading the paper, I notice that the transformer model requires pre-training. I don't think this is included in the recipe (or at least I did not find which part does it). So this could be the reason for the poor performance ?
Also, I did not find how to use the RNN model. In this branch, the conf is using e2e_asr_mix_transformer
which seems to be the transformer model. Is there an equivalent for the RNN ?
I'm thinking of switching to the new numerical_stability
branch if you think it may make things much easier.
Actually, I think I found the RNN model in espnet.nets.pytorch_backend.e2e_asr_mix:E2E
. It doesn't look like the file conf/tuning/train_rnn.yaml
was really the corresponding conf file so I modified the train_multispkr512_trans.yaml
to work with the RNN model. I am now training a model and will let you know how that works.
Re-reading the paper, I notice that the transformer model requires pre-training. I don't think this is included in the recipe (or at least I did not find which part does it). So this could be the reason for the poor performance ?
Actually, it is the WPD-based model that requires pre-training in the paper "End-to-End Far-Field Speech Recognition with Unified Dereverberation and Beamforming". The config file conf/tuning/train_multispkr512_trans.yaml is using MVDR beamforming, which does not require pre-training in my experiments.
And I didn't include a stage for pre-training the ASR backend in the recipe. In case you want to do that, you could train an ASR model on the original WSJ corpus (16k) using the egs/wsj/asr1
recipe, or download one from the links given in egs/wsj/asr1/RESULTS.md.
Also, I did not find how to use the RNN model. In this branch, the conf is using
e2e_asr_mix_transformer
which seems to be the transformer model. Is there an equivalent for the RNN ?
Here, the suffix transformer
only indicates that we are using transformer-based ASR backend, while the frontend is always based on RNN.
And I didn't upload the config file for an RNN-based backend. Because in our preliminary experiments [1] [2], we found that transformer-based ASR backend significantly outperforms the RNN-based one.
If you do want to train with an RNN-based ASR backend, you could try the following configuration:
I'm thinking of switching to the new
numerical_stability
branch if you think it may make things much easier.
Yes, I do recommend it. The new branch has been verified on various datasets to improve the numerical stability during training.
Ouch! use_WPD_frontend
is set to True
in run.sh
😅 So I will set it to False
and try again 😄
Describe the issue
Hi! I would like to repeat the experiment done in the paper "End-to-End Far-Field Speech Recognition with Unified Dereverberation and Beamforming". It seems that the branch
wsj1_mix_spatialized
contains the code to do so in the recipeegs/wsj1_mix_spatialized/asr1
. After struggling a bit, I have managed to create the necessary data, but now I am running into some problems when running stage 2 ofrun.sh
. In particular, the filedata/tr_spatialized_anechoic_multich/data.json
is not found. For the equivalentreverb
folder it was generated during stage 1 I believe. But, is the anechoic data also used during training ? Also, is there a more recent version of the recipe somewhere ?Thanks for your help!
Below, the error.