Rumeysakeskin / Speech-Datasets-for-ASR

Download speech datasets (English and non-English) for Automatic Speech Recognition
13 stars 0 forks source link

Error when reading tgz_prompt_file #1

Open jingru-lin opened 1 year ago

jingru-lin commented 1 year ago

Hi, when I try to download Germen by setting VOXFORGE_URL_16kHz = 'http://www.repository.voxforge1.org/downloads/de/Trunk/Audio/Main/16kHz_16bit/', I will have the following error:

Traceback (most recent call last): File "download_voxforge_dataset.py", line 184, in prepare_sample(f.replace(".tgz", ""), VOXFORGE_URL_16kHz + f, target_dir) File "download_voxforge_dataset.py", line 139, in prepare_sample transcriptions = open(tgz_prompt_file).read().strip().split("\n") File "/home/jingru/anaconda3/envs/text_ss/lib/python3.7/codecs.py", line 322, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 144: invalid start byte

This occurs for some of the files with characters that cannot be decoded with 'utf-8', for example: "ralfherzog-20070819_de2/mfc/de2-02 DIESE SICHERHEITSLüCKEN SIND BISHER UNBEKANNT" The "ü" is an invalid 'utf-8' code, may I know how to solve this?

Thank you!

Rumeysakeskin commented 1 year ago

@LevanaRu, Maybe all_files = re.findall("href\=\"(.*\.tgz)\"", content.decode("utf-8", 'ignore')) works for you.