Problem when running stage 2 of `egs/wsj1_mix_spatialized/asr1`

fakufaku commented 3 years ago

Describe the issue

Hi! I would like to repeat the experiment done in the paper "End-to-End Far-Field Speech Recognition with Unified Dereverberation and Beamforming". It seems that the branch wsj1_mix_spatialized contains the code to do so in the recipe egs/wsj1_mix_spatialized/asr1. After struggling a bit, I have managed to create the necessary data, but now I am running into some problems when running stage 2 of run.sh. In particular, the file data/tr_spatialized_anechoic_multich/data.json is not found. For the equivalent reverb folder it was generated during stage 1 I believe. But, is the anechoic data also used during training ? Also, is there a more recent version of the recipe somewhere ?

Thanks for your help!

Below, the error.

> ./run.sh --stage 2 --stop_stage 2
dictionary: data/lang_1char/tr_units.txt
stage 2: Dictionary and Json Data Preparation
make a non-linguistic symbol list
<*IN*>
<*MR.*>
<NOISE>
make a dictionary
make json files
local/data2json.sh --cmd run.pl --nj 30 --num-spkrs 2 --category multichannel --preprocess-conf conf/preprocess.yaml --filetype sound.hdf5 --feat data/tr_spatialized_reverb_multich/feats.scp --nlsyms data/lang_1char/non_lang_syms.txt --out data/tr_spatialized_reverb_multich/data.json data/tr_spatialized_reverb_multich data/lang_1char/tr_units.txt
/mnt/audio-research-datasets/robin/MIMO-IVA-ASR/espnet/egs/wsj1_mix_spatialized/asr1/../../../utils/feat_to_shape.sh --cmd run.pl --nj 30 --filetype sound.hdf5 --preprocess-conf conf/preprocess.yaml --verbose 0 data/tr_spatialized_reverb_multich/feats.scp data/tr_spatialized_reverb_multich/tmp-1pijL/input/shape.scp
local/data2json.sh --cmd run.pl --nj 30 --num-spkrs 2 --category multichannel --preprocess-conf conf/preprocess.yaml --filetype sound.hdf5 --feat data/cv_spatialized_reverb_multich/feats.scp --nlsyms data/lang_1char/non_lang_syms.txt --out data/cv_spatialized_reverb_multich/data.json data/cv_spatialized_reverb_multich data/lang_1char/tr_units.txt
/mnt/audio-research-datasets/robin/MIMO-IVA-ASR/espnet/egs/wsj1_mix_spatialized/asr1/../../../utils/feat_to_shape.sh --cmd run.pl --nj 30 --filetype sound.hdf5 --preprocess-conf conf/preprocess.yaml --verbose 0 data/cv_spatialized_reverb_multich/feats.scp data/cv_spatialized_reverb_multich/tmp-TLmY4/input/shape.scp
local/data2json.sh --cmd run.pl --nj 30 --num-spkrs 2 --category multichannel --preprocess-conf conf/preprocess.yaml --filetype sound.hdf5 --feat data/tt_spatialized_reverb_multich/feats.scp --nlsyms data/lang_1char/non_lang_syms.txt --out data/tt_spatialized_reverb_multich/data.json data/tt_spatialized_reverb_multich data/lang_1char/tr_units.txt
/mnt/audio-research-datasets/robin/MIMO-IVA-ASR/espnet/egs/wsj1_mix_spatialized/asr1/../../../utils/feat_to_shape.sh --cmd run.pl --nj 30 --filetype sound.hdf5 --preprocess-conf conf/preprocess.yaml --verbose 0 data/tt_spatialized_reverb_multich/feats.scp data/tt_spatialized_reverb_multich/tmp-EeLVv/input/shape.scp
local/data2json.sh --cmd run.pl --nj 30 --num-spkrs 1 --category singlespeaker --preprocess-conf conf/preprocess.yaml --filetype sound.hdf5 --feat data/train_si284/feats.scp --nlsyms data/lang_1char/non_lang_syms.txt --out data/train_si284/data.json data/train_si284 data/lang_1char/tr_units.txt
/mnt/audio-research-datasets/robin/MIMO-IVA-ASR/espnet/egs/wsj1_mix_spatialized/asr1/../../../utils/feat_to_shape.sh --cmd run.pl --nj 30 --filetype sound.hdf5 --preprocess-conf conf/preprocess.yaml --verbose 0 data/train_si284/feats.scp data/train_si284/tmp-tIZJP/input/shape.scp
2021-07-07 10:59:12,955 (concatjson:34) INFO: /mnt/audio-research-datasets/robin/MIMO-IVA-ASR/espnet/tools/venv/bin/python3 /mnt/audio-research-datasets/robin/MIMO-IVA-ASR/espnet/egs/wsj1_mix_spatialized/asr1/../../../utils/concatjson.py data/tr_spatialized_reverb_multich/data.json data/train_si284/data.json data/tr_spatialized_anechoic_multich/data.json
Traceback (most recent call last):
  File "/mnt/audio-research-datasets/robin/MIMO-IVA-ASR/espnet/egs/wsj1_mix_spatialized/asr1/../../../utils/concatjson.py", line 39, in <module>
    with codecs.open(x, encoding="utf-8") as f:
  File "/mnt/audio-research-datasets/robin/MIMO-IVA-ASR/espnet/tools/venv/lib/python3.8/codecs.py", line 905, in open
    file = builtins.open(filename, mode, buffering)
FileNotFoundError: [Errno 2] No such file or directory: 'data/tr_spatialized_anechoic_multich/data.json

Emrys365 commented 2 years ago

@fakufaku Sorry for my late response. I wasn't checking issues in this repository frequently.

It seems that the branch wsj1_mix_spatialized contains the code to do so in the recipe egs/wsj1_mix_spatialized/asr1.

Yes, this branch is intended for reproduction of the results in that paper.

After struggling a bit, I have managed to create the necessary data, but now I am running into some problems when running stage 2 of run.sh. In particular, the file data/tr_spatialized_anechoic_multich/data.json is not found. For the equivalent reverb folder it was generated during stage 1 I believe.

I assume the following directories should be generated after Stage 1:

$ ls data/
cv                               tr_spatialized_reverb_multich
cv_spatialized_anechoic_multich  tt
cv_spatialized_reverb_multich    tt_spatialized_anechoic_multich
tr                               tt_spatialized_reverb_multich
tr_spatialized_anechoic_multich  wsj

But, is the anechoic data also used during training ?

Yes. Both anechoic and reverberant versions of the training data are used to improve the performance.

Also, is there a more recent version of the recipe somewhere ?

For this paper, I don't have a updated recipe. But if you are interested, I have a new branch numerical_stability for the followup paper "End-to-End Dereverberation, Beamforming, and Speech Recognition with Improved Numerical Stability and Advanced Frontend". You can find the corresponding recipe in https://github.com/Emrys365/espnet/tree/numerical_stability/egs/wsj1_mix_spatialized/asr1. That branch is not documented yet. But the data is the same as in the current branch.

After some checks, I did miss a few lines of code in Stage 2 for preparing the required data. You can quickly fix the issue by replacing egs/wsj1_mix_spatialized/asr1/run.sh#L164 with:

for setname in tr_spatialized_reverb_multich tr_spatialized_anechoic_multich cv_spatialized_reverb_multich cv_spatialized_anechoic_multich tt_spatialized_reverb_multich tt_spatialized_anechoic_multich; do

fakufaku commented 2 years ago

Hi @Emrys365 thank you very much for the reply! I had paused this a little bit, will try to get back to it now 😄

I have managed to produce the data and train a model. The graph from the training are as follows.

Accuracy acc

Loss loss

However, at test time, I obtain values around 60% and 80% for the CER and WER so something seems wrong.

I should mention that I have also merged the branch with the latest version of ESPnet.

Do the training accuracy and loss look similar to what you have obtained in your experiments ?

Thanks a lot for your help!

Emrys365 commented 2 years ago

Oh that looks like not working.

Could you tell me some more details about your experiments?

Which model config did you use?
What was the training data?

BTW, I am also not sure whether the performance will be similar when my code is ported to the latest ESPnet. Sometimes the ASR performance can be different across different versions even with the same config (I don't expect the difference to be too large though). So it is recommended to use a similar version as I used if you want to reproduce it.

fakufaku commented 2 years ago

Thanks for the quick reply! Which parts hints that it is not working ? Is it only the poor CER/WER, or can you tell by the graphs too ? What are expected values to reach for accuracy (main/validation) and loss ?

The reason I have upgraded to the latest ESPNet is that I want to try some custom frontend based on the more recent pytorch versions (1.8+) with native complex type support.

Emrys365 commented 2 years ago

Which parts hints that it is not working ? Is it only the poor CER/WER, or can you tell by the graphs too ?

According to the curves, there is a large gap between the training and validation curves. So it means the model is likely to be overtrained, which may be due to numerical stability issues or other problems.

What are expected values to reach for accuracy (main/validation) and loss ?

For the WPE+MVDR model, I would expect validation/main/acc to be over 90%, while main/loss and validation/main/loss to be lower than 50.

The reason I have upgraded to the latest ESPNet is that I want to try some custom frontend based on the more recent pytorch versions (1.8+) with native complex type support.

I see. Then I will recommend you to try with a numerically more stable implementation in the branch numerical_stability, as I mentioned above.

fakufaku commented 2 years ago

I see, thank you very much, this precious information 😄

Re-reading the paper, I notice that the transformer model requires pre-training. I don't think this is included in the recipe (or at least I did not find which part does it). So this could be the reason for the poor performance ? Also, I did not find how to use the RNN model. In this branch, the conf is using e2e_asr_mix_transformer which seems to be the transformer model. Is there an equivalent for the RNN ?

I'm thinking of switching to the new numerical_stability branch if you think it may make things much easier.

fakufaku commented 2 years ago

Actually, I think I found the RNN model in espnet.nets.pytorch_backend.e2e_asr_mix:E2E. It doesn't look like the file conf/tuning/train_rnn.yaml was really the corresponding conf file so I modified the train_multispkr512_trans.yaml to work with the RNN model. I am now training a model and will let you know how that works.

Emrys365 commented 2 years ago

Re-reading the paper, I notice that the transformer model requires pre-training. I don't think this is included in the recipe (or at least I did not find which part does it). So this could be the reason for the poor performance ?

Actually, it is the WPD-based model that requires pre-training in the paper "End-to-End Far-Field Speech Recognition with Unified Dereverberation and Beamforming". The config file conf/tuning/train_multispkr512_trans.yaml is using MVDR beamforming, which does not require pre-training in my experiments.

And I didn't include a stage for pre-training the ASR backend in the recipe. In case you want to do that, you could train an ASR model on the original WSJ corpus (16k) using the egs/wsj/asr1 recipe, or download one from the links given in egs/wsj/asr1/RESULTS.md.

Also, I did not find how to use the RNN model. In this branch, the conf is using e2e_asr_mix_transformer which seems to be the transformer model. Is there an equivalent for the RNN ?

Here, the suffix transformer only indicates that we are using transformer-based ASR backend, while the frontend is always based on RNN. And I didn't upload the config file for an RNN-based backend. Because in our preliminary experiments [1] [2], we found that transformer-based ASR backend significantly outperforms the RNN-based one. If you do want to train with an RNN-based ASR backend, you could try the following configuration:

expand to see the config

```yaml # network architecture model-module: espnet.nets.pytorch_backend.e2e_asr_mix:E2E # encoder related etype: vggblstmp elayers-sd: 0 # number of speaker differentiate encoder layers elayers: 3 # number of recognition encoder layers eunits: 1024 eprojs: 1024 subsample: 1_2_2_1_1 # skip every n frame from input to nth layers # decoder related dlayers: 1 dunits: 300 # attention related atype: location adim: 320 awin: 5 aheads: 4 aconv-chans: 10 aconv-filts: 100 # hybrid CTC/attention mtlalpha: 0.2 # label smoothing lsm-type: unigram lsm-weight: 0.05 # minibatch related batch-size: 8 #batch-size: 8 #maxlen-in: 1000 # if input length > maxlen-in, batchsize is automatically reduced maxlen-out: 150 # if output length > maxlen-out, batchsize is automatically reduced # optimization related sortagrad: 0 # Feed samples from shortest to longest ; -1: enabled for all epochs, 0: disabled, other: enabled for 'other' epochs opt: adadelta accum-grad: 1 grad-clip: 5 patience: 3 epochs: 20 dropout-rate: 0.0 # scheduled sampling option sampling-probability: 0.0 # CMVN stats-file: fbank/tr_spatialized_all/cmvn.ark apply-uttmvn: True # reporting #report-wer: True # frontend related use-frontend: True use-beamforming-first: False # beamforming use-beamformer: True blayers: 3 bnmask: 3 bunits: 512 bprojs: 512 beamformer-type: mvdr # WPE use-wpe: True use-dnn-mask-for-wpe: True wlayers: 2 wunits: 300 wprojs: 300 wpe-taps: 5 #1 ```

I'm thinking of switching to the new numerical_stability branch if you think it may make things much easier.

Yes, I do recommend it. The new branch has been verified on various datasets to improve the numerical stability during training.

fakufaku commented 2 years ago

Ouch! use_WPD_frontend is set to True in run.sh 😅 So I will set it to False and try again 😄

Emrys365 / espnet

Problem when running stage 2 of `egs/wsj1_mix_spatialized/asr1` #10