1) VoxForge importer with corpora_importer util
2) Important fix other importers
3) Generation of mitads-speech dataset with all importers ( see config file: mitads-speech-full.yaml )
4) Training test
Total hours we got after corpora_collector process: 349.04 hours
I Test deep speech training on this dataset, only 10 epoch with parameter of notebook training example.
Median WER: 0.128205
Output:
(ds_train_dev) ubuntu@deepspeech:~/ds_eziolotta$ python DeepSpeech.py --show_progressbar True
--train_cudnn True
--alphabet_config_path /home/ubuntu/deep_speech_models/italian_alphabet.txt
--scorer /home/ubuntu/deep_speech_models/0.8/kenlm.scorer
--feature_cache /home/ubuntu/deep_speech_models/temp_train/sources/feature_cache
--train_files ${all_train_csv}
--dev_files ${all_dev_csv}
--test_files ${all_test_csv}
--train_batch_size 64
--dev_batch_size 64
--test_batch_size 64
--n_hidden 2048
--epochs 10
--learning_rate 0.0001
--dropout_rate 0.4
--max_to_keep 3
--checkpoint_dir /home/ubuntu/deep_speech_models/ckpts/ita/deepspeech-0.9.3-checkpoint
--summary_dir /home/ubuntu/deep_speech_models/temp_train/tboard_logs
--early_stop
--es_epochs 10
--automatic_mixed_precision true
--log_level 1
[.......]
Epoch 8 | Training | Elapsed Time: 0:31:04 | Steps: 1483 | Loss: 25.444996
Epoch 8 | Validation | Elapsed Time: 0:01:44 | Steps: 219 | Loss: 28.228351 | Dataset: /mitads-speech-dataset/mitads-speech-full_v0.1/dev.csv
I Saved new best validating model with loss 28.228351 to: /home/ubuntu/deep_speech_models/ckpts/ita/deepspeech-0.9.3-checkpoint/best_dev-17783
--------------------------------------------------------------------------------
Epoch 9 | Training | Elapsed Time: 0:16:20 | Steps: 1117 | Loss: 17.326802 Epoch 9 | Training | Elapsed Time: 0:31:06 | Steps: 1483 | Loss: 23.981334
Epoch 9 | Validation | Elapsed Time: 0:01:45 | Steps: 219 | Loss: 27.230283 | Dataset: /mitads-speech-dataset/mitads-speech-full_v0.1/dev.csv
I Saved new best validating model with loss 27.230283 to: /home/ubuntu/deep_speech_models/ckpts/ita/deepspeech-0.9.3-checkpoint/best_dev-19266
--------------------------------------------------------------------------------
I FINISHED optimization in 5:30:02.122314
WARNING:tensorflow:From /home/ubuntu/miniconda3/envs/ds_train_dev/lib/python3.7/site-packages/tensorflow_core/contrib/rnn/python/ops/lstm_ops.py:597: Layer.add_variable (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version.
Instructions for updating:
Please use `layer.add_weight` method instead.
W0318 16:57:11.119858 140022320514880 deprecation.py:323] From /home/ubuntu/miniconda3/envs/ds_train_dev/lib/python3.7/site-packages/tensorflow_core/contrib/rnn/python/ops/lstm_ops.py:597: Layer.add_variable (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version.
Instructions for updating:
Please use `layer.add_weight` method instead.
I Loading best validating checkpoint from /home/ubuntu/deep_speech_models/ckpts/ita/deepspeech-0.9.3-checkpoint/best_dev-19266
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/kernel
I Loading variable from checkpoint: global_step
I Loading variable from checkpoint: layer_1/bias
I Loading variable from checkpoint: layer_1/weights
I Loading variable from checkpoint: layer_2/bias
I Loading variable from checkpoint: layer_2/weights
I Loading variable from checkpoint: layer_3/bias
I Loading variable from checkpoint: layer_3/weights
I Loading variable from checkpoint: layer_5/bias
I Loading variable from checkpoint: layer_5/weights
I Loading variable from checkpoint: layer_6/bias
I Loading variable from checkpoint: layer_6/weights
Testing model on /mitads-speech-dataset/mitads-speech-full_v0.1/test.csv
Test epoch | Steps: 219 | Elapsed Time: 0:26:37
Test on /mitads-speech-dataset/mitads-speech-full_v0.1/test.csv - WER: 0.152582, CER: 0.053754, loss: 27.156124
--------------------------------------------------------------------------------
Best WER:
--------------------------------------------------------------------------------
WER: 0.000000, CER: 0.013514, loss: 61.034695
- wav: file:///mitads-speech-dataset/mitads-speech-full_v0.1/audios/voxforge/it-0991-copy-8647.wav
- src: "si tratta semplicemente di far rotolare attraverso i boschi questi macigni"
- res: "si tratta semplicemente di far rotolare attraverso i boschi questi macigni "
--------------------------------------------------------------------------------
WER: 0.000000, CER: 0.012346, loss: 47.484196
- wav: file:///mitads-speech-dataset/mitads-speech-full_v0.1/audios/voxforge/it-0775-copy-8726.wav
- src: "non saprei rispose questi lanciando uno sguardo inquieto verso gli alberi giganti"
- res: "non saprei rispose questi lanciando uno sguardo inquieto verso gli alberi giganti "
--------------------------------------------------------------------------------
WER: 0.000000, CER: 0.018182, loss: 45.007946
- wav: file:///mitads-speech-dataset/mitads-speech-full_v0.1/audios/voxforge/it-1065-copy-9270.wav
- src: "e voi credete cavaliere che egli possa sospettare di me"
- res: "e voi credete cavaliere che egli possa sospettare di me "
--------------------------------------------------------------------------------
WER: 0.000000, CER: 0.018519, loss: 44.839371
- wav: file:///mitads-speech-dataset/mitads-speech-full_v0.1/audios/siwis/IT_B_32_312.wav
- src: "in breve si tratta di un problema di grande importanza"
- res: "in breve si tratta di un problema di grande importanza "
--------------------------------------------------------------------------------
WER: 0.000000, CER: 0.008621, loss: 41.642883
- wav: file:///mitads-speech-dataset/mitads-speech-full_v0.1/audios/siwis/IT_C_36_196.wav
- src: "la sospensione delle vendite attuata da una catena di supermercati non sono notizie da passare all'opinione pubblica"
- res: "la sospensione delle vendite attuata da una catena di supermercati non sono notizie da passare all'opinione pubblica "
--------------------------------------------------------------------------------
Median WER:
--------------------------------------------------------------------------------
WER: 0.128205, CER: 0.021930, loss: 27.179636
- wav: file:///mitads-speech-dataset/mitads-speech-full_v0.1/audios/mls/2033_1596_001617.wav
- src: "dalla platea e dalle gallerie i ragazzi applaudivano ogni volta che passava uno molto piccolo o uno che dai vestiti paresse povero e anche quelli che avevano delle gran capigliature ricciolute o eran vestiti di rosso o di bianco"
- res: "dalla platea e dalle gallerie i ragazzi applaudivano ogni volta che passava uno molto piccolo o uno che dai vestiti paresse povero e anche quelli che avevano delle gran capigliature riccioluta o era un vestito di rosso e di bianco"
--------------------------------------------------------------------------------
WER: 0.128205, CER: 0.037037, loss: 23.368874
- wav: file:///mitads-speech-dataset/mitads-speech-full_v0.1/audios/mls/6348_5862_000104.wav
- src: "manco male che due portinaj in via volturno uno in via gaeta un altro in via palestro gli eran rimasti fedeli e lo aspettavano le altre copie doveva venderle cosí alla ventura girando per tutto il quartiere del macao"
- res: "manco male che due portinai via volturno uno in via gaeta un altro via palestro gli erano rimasti fedeli e lo aspettavano le altre copie doveva venderle così alla ventura girando per tutto il quartiere del macao"
--------------------------------------------------------------------------------
WER: 0.128205, CER: 0.031390, loss: 19.807188
- wav: file:///mitads-speech-dataset/mitads-speech-full_v0.1/audios/mls/8828_8610_000150.wav
- src: "barch cioè vedi lo fai dire anche a me i dia due paja di bacchette e dàlli calosce per queste bambine le chiama barchette la mia piccina veramente si potrebbero anche chiamare cosí per non usare quella parolaccia forestiera"
- res: "perche cioè vedi lo fai dire anche a me mi dia due paia di bacchette e dalli calosce per queste bambine le chiama barchette la mia piccina veramente si potrebbero anche chiamare così per non usare quella parolaccia forestiera"
--------------------------------------------------------------------------------
WER: 0.128205, CER: 0.018100, loss: 13.439837
- wav: file:///mitads-speech-dataset/mitads-speech-full_v0.1/audios/mls/8828_8610_000109.wav
- src: "e subito tutte le membra le si rilassarono cosí che non poté neanche sollevare le gracili mani per nascondersi il volto ma la vecchia mamma le si accostò e posandole lievemente una mano sulla spalla figlia mia le annunziò"
- res: "e subito tutte le membra le si rilassarono così che non pote neanche sollevare le gracili mani per nascondersi il volto ma la vecchia mamma le si accostò e posando le lievemente una mano sulla spalla figlia mia e annunziò"
--------------------------------------------------------------------------------
WER: 0.129032, CER: 0.069444, loss: 90.858200
- wav: file:///mitads-speech-dataset/mitads-speech-full_v0.1/audios/mls/4975_4125_000201.wav
- src: "sospesi consci dell'orribile impressione che sua eccellenza destava in tutta la cittadinanza e infatti parve a tutti che il cielo il gajo aspetto della nostra bianca cittadina s'oscurassero a quell'apparizione ispida"
- res: "consci dell'orribile impressione che sua eccellenza destava in tutta la cittadinanza e infatti parve a tutti che il cielo il grassetto della nostra bianca cittadina oscurassero a quell'apparizione ispida"
--------------------------------------------------------------------------------
Worst WER:
--------------------------------------------------------------------------------
WER: 2.000000, CER: 1.333333, loss: 8.697345
- wav: file:///mitads-speech-dataset/mitads-speech-full_v0.1/audios/evalita2009/clean00866.wav
- src: "due"
- res: "e a "
--------------------------------------------------------------------------------
WER: 2.000000, CER: 0.500000, loss: 8.080779
- wav: file:///mitads-speech-dataset/mitads-speech-full_v0.1/audios/evalita2009/clean02285.wav
- src: "nove"
- res: "no e "
--------------------------------------------------------------------------------
WER: 2.000000, CER: 0.500000, loss: 6.972440
- wav: file:///mitads-speech-dataset/mitads-speech-full_v0.1/audios/evalita2009/clean02275.wav
- src: "nove"
- res: "no e "
--------------------------------------------------------------------------------
WER: 2.000000, CER: 0.500000, loss: 5.985777
- wav: file:///mitads-speech-dataset/mitads-speech-full_v0.1/audios/evalita2009/clean02975.wav
- src: "nove"
- res: "no ve "
--------------------------------------------------------------------------------
WER: 2.000000, CER: 0.750000, loss: 5.125335
- wav: file:///mitads-speech-dataset/mitads-speech-full_v0.1/audios/evalita2009/clean02895.wav
- src: "nove"
- res: "no me "
--------------------------------------------------------------------------------
1) VoxForge importer with corpora_importer util 2) Important fix other importers 3) Generation of mitads-speech dataset with all importers ( see config file: mitads-speech-full.yaml ) 4) Training test
Total hours we got after corpora_collector process: 349.04 hours
COLLECTED COPRUS----MINUTES------SPEAKERS voxforge-----------------1202.29--------1062 mls-----------------------11274.23-------59 mspka--------------------175.49---------3 m-ailabs------------------7681.88-------208 evalita2009---------------341.77---------? siwis----------------------266.97---------16
TRAINING TEST
I Test deep speech training on this dataset, only 10 epoch with parameter of notebook training example. Median WER: 0.128205
Output: