Open wasertech opened 2 years ago
Just to be clear, this repros with pure wget
here, so it is likely a server-side issue. Nevertheless, I think making the downloader code more reliable to failures is a nice goal.
Exactly, we need a more robust downloader for
bin/import_m-ailabs.py
is also affected.
root@95a2d562414f:/code/data/lm# python /code/bin/import_m-ailabs.py --skiplist monsieur_lecoq,les_mysteres_de_paris --language fr_FR /mnt/extracted/data/M-AILABS/2021-12-20
17:38:53.774066: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.11.0
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
WARNING: No --validate_label_locale specified, you might end with inconsistent dataset.
No path "/mnt/extracted/data/M-AILABS" - creating ...
No archive "/mnt/extracted/data/M-AILABS/fr_FR.tgz" - downloading...
8343it [00:00, 16904868.73it/s]
No directory "/mnt/extracted/data/M-AILABS/fr_FR" - extracting archive...
Traceback (most recent call last):
File "/code/bin/import_m-ailabs.py", line 242, in <module>
_download_and_preprocess_data(target_dir=CLI_ARGS.target_dir)
File "/code/bin/import_m-ailabs.py", line 37, in _download_and_preprocess_data
_maybe_extract(target_dir, ARCHIVE_DIR_NAME, archive_path)
File "/code/bin/import_m-ailabs.py", line 49, in _maybe_extract
tar = tarfile.open(archive_path)
File "/usr/lib/python3.6/tarfile.py", line 1576, in open
raise ReadError("file could not be opened successfully")
tarfile.ReadError: file could not be opened successfully
The reason tarfile
throws a ReadError
here is because the downloader doesn't download the archive at all.
No archive "/mnt/extracted/data/M-AILABS/fr_FR.tgz" - downloading...
8343it [00:00, 16904868.73it/s]
Which produces:
root@95a2d562414f:/mnt/extracted/data/M-AILABS/2021-12-20# ls -l fr_FR.tgz
-rw-r--r-- 1 root root 8343 Dec 20 19:38 fr_FR.tgz
root@95a2d562414f:/mnt/extracted/data/M-AILABS/2021-12-20# du -hs fr_FR.tgz
12K fr_FR.tgz
wget --continue https://data.solak.de/data/Training/stt_tts/fr_FR.tgz
Which should produce:
❯ sudo wget --continue https://data.solak.de/data/Training/stt_tts/fr_FR.tgz
[sudo] Mot de passe de waser :
--2021-12-20 21:09:55-- https://data.solak.de/data/Training/stt_tts/fr_FR.tgz
SSL_INIT
Certificat de l’autorité de certification « /etc/ssl/certs/ca-certificates.crt » chargé
Résolution de data.solak.de (data.solak.de)… 46.163.77.97
Connexion à data.solak.de (data.solak.de)|46.163.77.97|:443… connecté.
requête HTTP transmise, en attente de la réponse… 200 OK
Taille : 15874196160 (15G) [application/octet-stream]
Sauvegarde en : « fr_FR.tgz »
fr_FR.tgz 100%[======================================================================================>] 14.78G 6.21MB/s ds 41m 17s
2021-12-20 21:51:12 (6.11 MB/s) — « fr_FR.tgz » sauvegardé [15874196160/15874196160]
❯ ls -la fr_FR.tgz
.rw-r--r-- root root 15 GB Wed Jun 2 12:36:34 2021 fr_FR.tgz
❯ du -hs fr_FR.tgz
15G fr_FR.tgz
import_mls.py
(prev. import_mls_english.py
) is also affected as the dataset is served from AWS. See
@wasertech is this still valid with your recent PRs?
@reuben unless we changed something in the downloader, it’s still an issue yes.<\del>
After testing I can confirm it's still an issue.
Thinking about this issue, maybe using allow_redirects=True
for requests.get
in util/downloader.py might solve our downloading problems.
req = requests.get(archive_url, stream=True, allow_redirects=True) #:28
I'll have to make some tests.
Nope.. it did not change the behavior of the affected importers:
Can't download LinguaLibre using import script
Problem
When you try to download LinguaLibre (at least for the french part), down stream stops every time at 51%.
How to reproduce
Just run the following command:
Expected behavior
The compressed data should be entirely downloaded.
Hot fix
You can manually download the data using
wget --continue https://lingualibre.org/datasets/Q21-fra-French.zip
.After chatting with @lissyx we concluded that this fix needs more than adding the
--continue
flag to wget as Coqui STT usesutil/downloader.py
to download data.