DCASE-REPO / dcase2018_baseline

DCASE 2018 Baseline systems
MIT License
128 stars 78 forks source link

Task 4 download: terminates on unicode error #4

Closed danstowell closed 6 years ago

danstowell commented 6 years ago

Hi - I tried running the Task 4 code, using Python 2.7. (The README says that's OK, even though the testing was done with python 3)

I first had a problem with relative imports, which I worked around. Then I ran the script and after a few hours of download I got this error:

 'ascii' codec can't encode character u'\u3010' in position 46: ordinal not in range(128)

I was not given a full backtrace, just that line, so I don't know where in the code this error happened.

I don't know if the error is caused by (a) Python 2/3 differences in unicode handling, or (b) some bad input data that might need fixing or protecting against.

turpaultn commented 6 years ago

I try to reproduce your error but still don't succeed. I think it can come from dataset/download_data.py when I do str(e) in line 96 and 103. Could you please replace str(e) by "Error" for example and tell me if it still happens ?

turpaultn commented 6 years ago

Thank you for pointing out the relative import. init.py file was missing in dataset folder, I updated the repo. Please tell me if it came from another problem.

danstowell commented 6 years ago

Hi, thanks. At the moment now I'm getting a different failure which prevents download. (This time with python 3.5.2):

 % python download_data.py
[I] Download_data
[I] Once database is downloaded, do not forget to check your missing_files
[I] Train, unlabel out of domain data
  4%|██████                                                                                                                                                | 101/2494 [00:41<16:22,  2.44it/s]multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/lib/python3.5/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/usr/lib/python3.5/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "download_data.py", line 60, in download_file
    'https://www.youtube.com/watch?v={query_id}'.format(query_id=query_id), download=True)
  File "/home/dans/dev/github/dcase2018_baseline/task4/venv3/lib/python3.5/site-packages/youtube_dl/YoutubeDL.py", line 800, in extract_info
    return self.process_ie_result(ie_result, download, extra_info)
  File "/home/dans/dev/github/dcase2018_baseline/task4/venv3/lib/python3.5/site-packages/youtube_dl/YoutubeDL.py", line 854, in process_ie_result
    return self.process_video_result(ie_result, download=download)
  File "/home/dans/dev/github/dcase2018_baseline/task4/venv3/lib/python3.5/site-packages/youtube_dl/YoutubeDL.py", line 1627, in process_video_result
    self.process_info(new_info)
  File "/home/dans/dev/github/dcase2018_baseline/task4/venv3/lib/python3.5/site-packages/youtube_dl/YoutubeDL.py", line 1900, in process_info
    success = dl(filename, info_dict)
  File "/home/dans/dev/github/dcase2018_baseline/task4/venv3/lib/python3.5/site-packages/youtube_dl/YoutubeDL.py", line 1839, in dl
    return fd.download(name, info)
  File "/home/dans/dev/github/dcase2018_baseline/task4/venv3/lib/python3.5/site-packages/youtube_dl/downloader/common.py", line 365, in download
    return self.real_download(filename, info_dict)
  File "/home/dans/dev/github/dcase2018_baseline/task4/venv3/lib/python3.5/site-packages/youtube_dl/downloader/dash.py", line 25, in real_download
    self._prepare_and_start_frag_download(ctx)
  File "/home/dans/dev/github/dcase2018_baseline/task4/venv3/lib/python3.5/site-packages/youtube_dl/downloader/fragment.py", line 69, in _prepare_and_start_frag_download
    self._prepare_frag_download(ctx)
  File "/home/dans/dev/github/dcase2018_baseline/task4/venv3/lib/python3.5/site-packages/youtube_dl/downloader/fragment.py", line 160, in _prepare_frag_download
    self._read_ytdl_file(ctx)
  File "/home/dans/dev/github/dcase2018_baseline/task4/venv3/lib/python3.5/site-packages/youtube_dl/downloader/fragment.py", line 78, in _read_ytdl_file
    ctx['fragment_index'] = json.loads(stream.read())['downloader']['current_fragment']['index']
  File "/usr/lib/python3.5/json/__init__.py", line 319, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.5/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.5/json/decoder.py", line 357, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
"""

I tried this twice and the same thing happened. The query_id that caused this error was oQ-0NQU1hVA

ankitshah009 commented 6 years ago

Hi - could you please post the environment configuration details to reproduce the error? With my current version of dcase testing environment, I haven't encountered this error earlier.

danstowell commented 6 years ago

Ubuntu 16.04.4 LTS

Python 3.5.2

Modules installed using venv and pip. Here's the output of pip freeze:

absl-py==0.2.0
astor==0.6.2
audioread==2.1.5
bleach==1.5.0
certifi==2018.4.16
cffi==1.11.5
chardet==3.0.4
colorama==0.3.9
cycler==0.10.0
dcase-util==0.2.0
decorator==4.3.0
future==0.16.0
gast==0.2.0
grpcio==1.11.0
h5py==2.7.1
html5lib==0.9999999
idna==2.6
joblib==0.11
Keras==2.1.6
kiwisolver==1.0.1
librosa==0.6.0
llvmlite==0.23.0
Markdown==2.6.11
matplotlib==2.2.2
msgpack-python==0.5.6
numba==0.38.0
numpy==1.14.2
pafy==0.5.4
pandas==0.22.0
pkg-resources==0.0.0
protobuf==3.5.2.post1
pycparser==2.18
pydot-ng==1.0.0
pyparsing==2.2.0
python-dateutil==2.7.2
python-magic==0.4.15
pytz==2018.4
PyYAML==3.12
requests==2.18.4
resampy==0.2.0
scikit-learn==0.19.1
scipy==1.0.1
sed-eval==0.2.0
six==1.11.0
SoundFile==0.10.2
tensorboard==1.7.0
tensorflow==1.7.0
termcolor==1.1.0
titlecase==0.12.0
tqdm==4.23.0
ujson==1.35
urllib3==1.22
validators==0.12.1
Werkzeug==0.14.1
youtube-dl==2018.4.25
turpaultn commented 6 years ago

Thank you, I do not see any problem in your installation. However, this kind of error is produced by youtube-dl and I do not know why. Here is an error you probably saw: https://github.com/rg3/youtube-dl/issues/11018 In this one, an ExtractError was also produced which is caught in download_data.py. I do not know why it is not the case in your environment. I had youtbe-dl 2018.4.9, I just updated in 3.5 and I am trying to reproduce your error, still without success, I'll tell you if it happens. To go through this error, you can add it in the Errors kept in download_data.py (line 92), then we would be able to see how often it appears or if it is just this query.

danstowell commented 6 years ago

I discovered that the relevant file, ./dataset/tmp/oQ-0NQU1hVA.m4a.ytdl is in fact a zero-byte file. If I delete the zero-byte file and re-run download_data.py, then... actually the same problem happens with something else. If I delete ALL the files in tmp (there were lots of zero-byte files - this could have been caused by a disk-full issue) and re-run, it works.

I agree that the cause is probably that youtube-dl is not gracefully handling the unexpected input.

I guess at least we have a workaround: delete zero-byte files from tmp and restart.