facebookresearch / voxpopuli

A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation
Other
510 stars 51 forks source link

get_asr_data.py can't open output file #26

Closed dpoljak closed 3 years ago

dpoljak commented 3 years ago

Hello, I'm trying to acces the croatian transcribed ASR dataset, but I'm having trouble, similarly to #25 i get a broken pipe error, but it is preceeded by formats errors for opening the transcribed_data files

Running on any language target I recieve the same errors, in the following excerpt is the line for english

$ python -m voxpopuli.get_asr_data --root cro_asr --lang en
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 60.0M/60.0M [00:17<00:00, 3.53MB/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████| 412484/412484 [00:13<00:00, 29593.20it/s]
  0%|                                                                                                              | 0/4068 [00:00<?, ?it/s]formats: can't open output file `cro_asr/transcribed_data/en/2013/20131007-0900-PLENARY-19-en_20131007-21:26:04_7.ogg': Invalid argument
  0%|                                                                                                              | 0/4068 [00:00<?, ?it/s]
formats: can't open output file `cro_asr/transcribed_data/en/2010/20100705-0900-PLENARY-18-en_20100705-22:38:51_12.ogg': Invalid argument
formats: can't open output file `cro_asr/transcribed_data/en/2017/20170313-0900-PLENARY-14-en_20170313-21:48:27_5.ogg': Invalid argument
formats: can't open output file `cro_asr/transcribed_data/en/2009/20091126-0900-PLENARY-14-en_20091126-15:04:38_4.ogg': Invalid argument
formats: can't open output file `cro_asr/transcribed_data/en/2015/20151126-0900-PLENARY-14-en_20151126-12:34:33_5.ogg': Invalid argument
formats: can't open output file `cro_asr/transcribed_data/en/2014/20140416-0900-PLENARY-18-en_20140416-20:56:31_18.ogg': Invalid argument
formats: can't open output file `cro_asr/transcribed_data/en/2010/20100615-0900-PLENARY-16-en_20100615-23:12:01_2.ogg': Invalid argument
formats: can't open output file `cro_asr/transcribed_data/en/2010/20100907-0900-PLENARY-5-en_20100907-12:55:03_4.ogg': Invalid argument
formats: can't open output file `cro_asr/transcribed_data/en/2014/20140916-0900-PLENARY-19-en_20140916-22:04:17_11.ogg': Invalid argument
formats: can't open output file `cro_asr/transcribed_data/en/2016/20160913-0900-PLENARY-16-en_20160913-17:59:24_8.ogg': Invalid argument
formats: can't open output file `cro_asr/transcribed_data/en/2013/20130703-0900-PLENARY-21-en_20130703-22:09:27_2.ogg': Invalid argument
formats: can't open output file `cro_asr/transcribed_data/en/2013/20131022-0900-PLENARY-3-en_20131022-08:32:08_10.ogg': Invalid argument
formats: can't open output file `cro_asr/transcribed_data/en/2011/20110404-0900-PLENARY-16-en_20110404-23:01:37_5.ogg': Invalid argument
formats: can't open output file `cro_asr/transcribed_data/en/2018/20180531-0900-PLENARY-8-en_20180531-12:28:14_1.ogg': Invalid argument
formats: can't open output file `cro_asr/transcribed_data/en/2013/20131121-0900-PLENARY-9-en_20131121-12:47:01_1.ogg': Invalid argument
formats: can't open output file `cro_asr/transcribed_data/en/2009/20090205-0900-PLENARY-12-en_20090205-15:42:01_4.ogg': Invalid argument
formats: can't open output file `cro_asr/transcribed_data/en/2013/20130909-0900-PLENARY-17-en_20130909-20:54:22_6.ogg': Invalid argument
formats: can't open output file `cro_asr/transcribed_data/en/2015/20150210-0900-PLENARY-9-en_20150210-17:20:54_3.ogg': Invalid argument
formats: can't open output file `cro_asr/transcribed_data/en/2015/20150112-0900-PLENARY-10-en_20150112-17:59:46_6.ogg': Invalid argument
formats: can't open output file `cro_asr/transcribed_data/en/2013/20130116-0900-PLENARY-7-en_20130116-12:47:50_5.ogg': Invalid argument
formats: can't open output file `cro_asr/transcribed_data/en/2011/20110705-0900-PLENARY-5-en_20110705-12:06:20_2.ogg': Invalid argument
formats: can't open output file `cro_asr/transcribed_data/en/2017/20170613-0900-PLENARY-16-en_20170613-19:30:10_8.ogg': Invalid argument
formats: can't open output file `cro_asr/transcribed_data/en/2017/20170301-0900-PLENARY-12-en_20170301-19:05:04_2.ogg': Invalid argument
formats: can't open output file `cro_asr/transcribed_data/en/2011/20110308-0900-PLENARY-10-en_20110308-15:48:54_4.ogg': Invalid argument
formats: can't open output file `cro_asr/transcribed_data/en/2014/20140311-0900-PLENARY-16-en_20140311-19:32:30_2.ogg': Invalid argument
formats: can't open output file `cro_asr/transcribed_data/en/2010/20100616-0900-PLENARY-11-en_20100616-17:07:05_10.ogg': Invalid argument
concurrent.futures.process._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/run/media/dario/Local Disk/Work/envs/vox2/lib/python3.8/concurrent/futures/process.py", line 239, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/run/media/dario/Local Disk/Work/envs/vox2/lib/python3.8/concurrent/futures/process.py", line 198, in _process_chunk
    return [fn(*args) for args in chunk]
  File "/run/media/dario/Local Disk/Work/envs/vox2/lib/python3.8/concurrent/futures/process.py", line 198, in <listcomp>
    return [fn(*args) for args in chunk]
  File "/run/media/dario/Local Disk/Work/voxpopuli/voxpopuli/get_asr_data.py", line 35, in cut_session
    torchaudio.save(out_path, segment, sr)
  File "/run/media/dario/Local Disk/Work/envs/vox2/lib/python3.8/site-packages/torchaudio/backend/sox_io_backend.py", line 316, in save
    torch.ops.torchaudio.sox_io_save_audio_file(
RuntimeError: Error saving audio file: failed to open file cro_asr/transcribed_data/en/2013/20131007-0900-PLENARY-19-en_20131007-21:26:04_7.ogg
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/run/media/dario/Local Disk/Work/envs/vox2/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/run/media/dario/Local Disk/Work/envs/vox2/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/run/media/dario/Local Disk/Work/voxpopuli/voxpopuli/get_asr_data.py", line 104, in <module>
    main()
  File "/run/media/dario/Local Disk/Work/voxpopuli/voxpopuli/get_asr_data.py", line 100, in main
    get(args)
  File "/run/media/dario/Local Disk/Work/voxpopuli/voxpopuli/get_asr_data.py", line 70, in get
    multiprocess_run(items, cut_session)
  File "/run/media/dario/Local Disk/Work/voxpopuli/voxpopuli/utils.py", line 14, in multiprocess_run
    process_map(func, a_list, max_workers=n_workers, chunksize=1)
  File "/run/media/dario/Local Disk/Work/envs/vox2/lib/python3.8/site-packages/tqdm/contrib/concurrent.py", line 130, in process_map
    return _executor_map(ProcessPoolExecutor, fn, *iterables, **tqdm_kwargs)
  File "/run/media/dario/Local Disk/Work/envs/vox2/lib/python3.8/site-packages/tqdm/contrib/concurrent.py", line 76, in _executor_map
    return list(tqdm_class(ex.map(fn, *iterables, **map_args), **kwargs))
  File "/run/media/dario/Local Disk/Work/envs/vox2/lib/python3.8/site-packages/tqdm/std.py", line 1178, in __iter__
    for obj in iterable:
  File "/run/media/dario/Local Disk/Work/envs/vox2/lib/python3.8/concurrent/futures/process.py", line 484, in _chain_from_iterable_of_lists
    for element in iterable:
  File "/run/media/dario/Local Disk/Work/envs/vox2/lib/python3.8/concurrent/futures/_base.py", line 619, in result_iterator
    yield fs.pop().result()
  File "/run/media/dario/Local Disk/Work/envs/vox2/lib/python3.8/concurrent/futures/_base.py", line 444, in result
    return self.__get_result()
  File "/run/media/dario/Local Disk/Work/envs/vox2/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
RuntimeError: Error saving audio file: failed to open file cro_asr/transcribed_data/en/2013/20131007-0900-PLENARY-19-en_20131007-21:26:04_7.ogg

My environment is in Python 3.8.10 on Manjaro 21.0.7 with dependencies:

Robotuks commented 3 years ago

Try deleting the whole dataset (or maybe that year where you get the error) and start a fresh download. Worked in my case #25

dpoljak commented 3 years ago

I have deleted the whole repo, redownloaded the repo, installed deps in a fresh conda environment. Redownloaded all the files today and am still hitting the same error messages :confused:

formats: can't open output file '../audios/voxpopuli_asr/transcribed_data/hr/2019/20190214-0900-PLENARY-hr_20190214-15:48:48_0.ogg': Invalid argument

From what I'm reading it might be due to my SoX installation, I'll try to look into it and circle back here if and when I solve it. However any and all pointers and suggestions are welcome :pray:

dpoljak commented 3 years ago

Okay I solved this. After testing out basic sox output for files I found out that it breaks on the name because it contains : which isn't supported on NTFS partitions. Moving the data to an ext4 partition and running the code with that root solved the error.