coqui-ai / STT

🐸STT - The deep learning toolkit for Speech-to-Text. Training and deploying STT models has never been so easy.
https://coqui.ai
Mozilla Public License 2.0
2.24k stars 272 forks source link

Feature request: Make the downloader more robust #2047

Open wasertech opened 2 years ago

wasertech commented 2 years ago

Can't download LinguaLibre using import script

Problem

When you try to download LinguaLibre (at least for the french part), down stream stops every time at 51%.

How to reproduce

Just run the following command:

+ python /home/trainer/stt/bin/import_lingua_libre.py --qId 21 --iso639-3 fra --english-name French --validate_label_locale /home/trainer/fr/validate_label.py --bogus-records /home/trainer/fr/lingua_libre_skiplist.txt /mnt/extracted/data/lingualibre
No path "/mnt/extracted/data/lingualibre" - creating ...
No archive "/mnt/extracted/data/lingualibre/Q21-fra-French.zip" - downloading...
 51%|█████████████████████████████████████████████████████████▍                                                      | 1083305868/2112950650 [02:55<02:46, 6168695.24it/s]
No directory "/mnt/extracted/data/lingualibre/lingua_libre" - extracting archive...
Traceback (most recent call last):
  File "/home/trainer/stt/bin/import_lingua_libre.py", line 266, in <module>
    _download_and_preprocess_data(target_dir=CLI_ARGS.target_dir)
  File "/home/trainer/stt/bin/import_lingua_libre.py", line 41, in _download_and_preprocess_data
    _maybe_extract(target_dir, ARCHIVE_DIR_NAME, archive_path)
  File "/home/trainer/stt/bin/import_lingua_libre.py", line 53, in _maybe_extract
    with zipfile.ZipFile(archive_path) as zip_f:
  File "/usr/lib/python3.6/zipfile.py", line 1131, in __init__
    self._RealGetContents()
  File "/usr/lib/python3.6/zipfile.py", line 1198, in _RealGetContents
    raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file

Expected behavior

The compressed data should be entirely downloaded.

Hot fix

You can manually download the data using wget --continue https://lingualibre.org/datasets/Q21-fra-French.zip.

After chatting with @lissyx we concluded that this fix needs more than adding the --continue flag to wget as Coqui STT uses util/downloader.py to download data.

lissyx commented 2 years ago

Just to be clear, this repros with pure wget here, so it is likely a server-side issue. Nevertheless, I think making the downloader code more reliable to failures is a nice goal.

wasertech commented 2 years ago

Exactly, we need a more robust downloader for

wasertech commented 2 years ago

bin/import_m-ailabs.py is also affected.

root@95a2d562414f:/code/data/lm# python /code/bin/import_m-ailabs.py --skiplist monsieur_lecoq,les_mysteres_de_paris --language fr_FR /mnt/extracted/data/M-AILABS/2021-12-20 
17:38:53.774066: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.11.0
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
WARNING: No --validate_label_locale specified, you might end with inconsistent dataset.
No path "/mnt/extracted/data/M-AILABS" - creating ...
No archive "/mnt/extracted/data/M-AILABS/fr_FR.tgz" - downloading...
8343it [00:00, 16904868.73it/s]
No directory "/mnt/extracted/data/M-AILABS/fr_FR" - extracting archive...
Traceback (most recent call last):
  File "/code/bin/import_m-ailabs.py", line 242, in <module>
    _download_and_preprocess_data(target_dir=CLI_ARGS.target_dir)
  File "/code/bin/import_m-ailabs.py", line 37, in _download_and_preprocess_data
    _maybe_extract(target_dir, ARCHIVE_DIR_NAME, archive_path)
  File "/code/bin/import_m-ailabs.py", line 49, in _maybe_extract
    tar = tarfile.open(archive_path)
  File "/usr/lib/python3.6/tarfile.py", line 1576, in open
    raise ReadError("file could not be opened successfully")
tarfile.ReadError: file could not be opened successfully

The reason tarfile throws a ReadError here is because the downloader doesn't download the archive at all.

No archive "/mnt/extracted/data/M-AILABS/fr_FR.tgz" - downloading...
8343it [00:00, 16904868.73it/s]

Which produces:

root@95a2d562414f:/mnt/extracted/data/M-AILABS/2021-12-20# ls -l fr_FR.tgz 
-rw-r--r-- 1 root root 8343 Dec 20 19:38 fr_FR.tgz
root@95a2d562414f:/mnt/extracted/data/M-AILABS/2021-12-20# du -hs fr_FR.tgz 
12K     fr_FR.tgz

Download MAILabs' dataset manually

wget --continue https://data.solak.de/data/Training/stt_tts/fr_FR.tgz

Which should produce:

❯ sudo wget --continue https://data.solak.de/data/Training/stt_tts/fr_FR.tgz
[sudo] Mot de passe de waser : 
--2021-12-20 21:09:55--  https://data.solak.de/data/Training/stt_tts/fr_FR.tgz
SSL_INIT
Certificat de l’autorité de certification « /etc/ssl/certs/ca-certificates.crt » chargé
Résolution de data.solak.de (data.solak.de)… 46.163.77.97
Connexion à data.solak.de (data.solak.de)|46.163.77.97|:443… connecté.
requête HTTP transmise, en attente de la réponse… 200 OK
Taille : 15874196160 (15G) [application/octet-stream]
Sauvegarde en : « fr_FR.tgz »

fr_FR.tgz                                  100%[======================================================================================>]  14.78G  6.21MB/s    ds 41m 17s 

2021-12-20 21:51:12 (6.11 MB/s) — « fr_FR.tgz » sauvegardé [15874196160/15874196160]

❯ ls -la fr_FR.tgz
.rw-r--r-- root root 15 GB Wed Jun  2 12:36:34 2021  fr_FR.tgz
❯ du -hs fr_FR.tgz
15G     fr_FR.tgz
wasertech commented 2 years ago

import_mls.py (prev. import_mls_english.py) is also affected as the dataset is served from AWS. See

reuben commented 2 years ago

@wasertech is this still valid with your recent PRs?

wasertech commented 2 years ago

@reuben unless we changed something in the downloader, it’s still an issue yes.<\del>

After testing I can confirm it's still an issue.

wasertech commented 2 years ago

Thinking about this issue, maybe using allow_redirects=True for requests.get in util/downloader.py might solve our downloading problems.

req = requests.get(archive_url, stream=True, allow_redirects=True) #:28

I'll have to make some tests. Nope.. it did not change the behavior of the affected importers: