Open SamuelCahyawijaya opened 7 months ago
@SamuelCahyawijaya @holylovenia Does anyone know how to handle RAR file? I kept getting the following error for this dataset:
line 184, in _split_generators
data_dir = dl_manager.download_and_extract(urls)
File "/usr/local/anaconda3/envs/seacrowd/lib/python3.10/site-packages/datasets/download/download_manager.py", line 561, in download_and_extract
return self.extract(self.download(url_or_urls))
File "/usr/local/anaconda3/envs/seacrowd/lib/python3.10/site-packages/datasets/download/download_manager.py", line 533, in extract
extracted_paths = map_nested(
File "/usr/local/anaconda3/envs/seacrowd/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 464, in map_nested
mapped = [
File "/usr/local/anaconda3/envs/seacrowd/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 465, in <listcomp>
_single_map_nested((function, obj, types, None, True, None))
File "/usr/local/anaconda3/envs/seacrowd/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 367, in _single_map_nested
return function(data_struct)
File "/usr/local/anaconda3/envs/seacrowd/lib/python3.10/site-packages/datasets/utils/file_utils.py", line 211, in cached_path
output_path = ExtractManager(cache_dir=download_config.cache_dir).extract(
File "/usr/local/anaconda3/envs/seacrowd/lib/python3.10/site-packages/datasets/utils/extract.py", line 48, in extract
self.extractor.extract(input_path, output_path, extractor_format)
File "/usr/local/anaconda3/envs/seacrowd/lib/python3.10/site-packages/datasets/utils/extract.py", line 344, in extract
return extractor.extract(input_path, output_path)
File "/usr/local/anaconda3/envs/seacrowd/lib/python3.10/site-packages/datasets/utils/extract.py", line 215, in extract
rf.extractall(output_path)
File "/usr/local/anaconda3/envs/seacrowd/lib/python3.10/site-packages/rarfile.py", line 886, in extractall
dst = self._extract_one(inf, path, pwd, not inf.is_dir())
File "/usr/local/anaconda3/envs/seacrowd/lib/python3.10/site-packages/rarfile.py", line 951, in _extract_one
return self._make_file(info, dstfn, pwd, set_attrs)
File "/usr/local/anaconda3/envs/seacrowd/lib/python3.10/site-packages/rarfile.py", line 966, in _make_file
shutil.copyfileobj(src, dst)
File "/usr/local/anaconda3/envs/seacrowd/lib/python3.10/shutil.py", line 195, in copyfileobj
buf = fsrc_read(length)
File "/usr/local/anaconda3/envs/seacrowd/lib/python3.10/site-packages/rarfile.py", line 2305, in read
raise BadRarFile("Failed the read enough data: req=%d got=%d" % (orig, len(data)))
rarfile.BadRarFile: Failed the read enough data: req=65536 got=51
@SamuelCahyawijaya @holylovenia Does anyone know how to handle RAR file? I kept getting the following error for this dataset:
line 184, in _split_generators data_dir = dl_manager.download_and_extract(urls) File "/usr/local/anaconda3/envs/seacrowd/lib/python3.10/site-packages/datasets/download/download_manager.py", line 561, in download_and_extract return self.extract(self.download(url_or_urls)) File "/usr/local/anaconda3/envs/seacrowd/lib/python3.10/site-packages/datasets/download/download_manager.py", line 533, in extract extracted_paths = map_nested( File "/usr/local/anaconda3/envs/seacrowd/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 464, in map_nested mapped = [ File "/usr/local/anaconda3/envs/seacrowd/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 465, in <listcomp> _single_map_nested((function, obj, types, None, True, None)) File "/usr/local/anaconda3/envs/seacrowd/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 367, in _single_map_nested return function(data_struct) File "/usr/local/anaconda3/envs/seacrowd/lib/python3.10/site-packages/datasets/utils/file_utils.py", line 211, in cached_path output_path = ExtractManager(cache_dir=download_config.cache_dir).extract( File "/usr/local/anaconda3/envs/seacrowd/lib/python3.10/site-packages/datasets/utils/extract.py", line 48, in extract self.extractor.extract(input_path, output_path, extractor_format) File "/usr/local/anaconda3/envs/seacrowd/lib/python3.10/site-packages/datasets/utils/extract.py", line 344, in extract return extractor.extract(input_path, output_path) File "/usr/local/anaconda3/envs/seacrowd/lib/python3.10/site-packages/datasets/utils/extract.py", line 215, in extract rf.extractall(output_path) File "/usr/local/anaconda3/envs/seacrowd/lib/python3.10/site-packages/rarfile.py", line 886, in extractall dst = self._extract_one(inf, path, pwd, not inf.is_dir()) File "/usr/local/anaconda3/envs/seacrowd/lib/python3.10/site-packages/rarfile.py", line 951, in _extract_one return self._make_file(info, dstfn, pwd, set_attrs) File "/usr/local/anaconda3/envs/seacrowd/lib/python3.10/site-packages/rarfile.py", line 966, in _make_file shutil.copyfileobj(src, dst) File "/usr/local/anaconda3/envs/seacrowd/lib/python3.10/shutil.py", line 195, in copyfileobj buf = fsrc_read(length) File "/usr/local/anaconda3/envs/seacrowd/lib/python3.10/site-packages/rarfile.py", line 2305, in read raise BadRarFile("Failed the read enough data: req=%d got=%d" % (orig, len(data))) rarfile.BadRarFile: Failed the read enough data: req=65536 got=51
Hi @fhudi, I've never personally tried to make a dataloader whose source is a rar
file. If it's not possible to extract using the download manager, could you try using other libraries to unrar the data?
cc: @sabilmakbar @SamuelCahyawijaya in case they know better than me.
@holylovenia
Just FYI, the error I posted is after the installation of pip install rarfile
, the default / recommended from the HF datasets.
I will try exploring the third-party libraries then. Hopefully there is one that can handle the files. Thanks. 🙏
@holylovenia Just FYI, the error I posted is after the installation of
pip install rarfile
, the default / recommended from the HF datasets. I will try exploring the third-party libraries then. Hopefully there is one that can handle the files. Thanks. 🙏
Sorry for the wait. @sabilmakbar and I are currently discussing about this. We will let you know shortly.
The extracted dataset is indeed big: 468MB train_set.rar
results in an astonishing 1.94GB train_set.csv
.
@fhudi Have you tried directly using rarfile
module? I am able to extract it directly.
Dataloader name:
id_gec/id_gec.py
DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?id_gec