SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
64 stars 57 forks source link

Create dataset loader for Indonesian GEC framework #622

Open SamuelCahyawijaya opened 5 months ago

SamuelCahyawijaya commented 5 months ago

Dataloader name: id_gec/id_gec.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?id_gec

Dataset id_gec
Description A large corpus of the Indonesian language that can be utilized for evaluating the next Indonesian GEC task. The parallel data was generated with a semi-supervised confusion method. The data includes common errors made by Indonesian second language learners and native speakers, including syntax errors, spelling errors, and semantic errors.
Subsets -
Languages ind
Tasks Grammatical Error Correction
License Unknown (unknown)
Homepage https://github.com/Almangiri/Indonesian-GEC-framework/tree/main
HF URL -
Paper URL -
fhudi commented 5 months ago

self-assign

fhudi commented 5 months ago

@SamuelCahyawijaya @holylovenia Does anyone know how to handle RAR file? I kept getting the following error for this dataset:

  line 184, in _split_generators
    data_dir = dl_manager.download_and_extract(urls)
  File "/usr/local/anaconda3/envs/seacrowd/lib/python3.10/site-packages/datasets/download/download_manager.py", line 561, in download_and_extract
    return self.extract(self.download(url_or_urls))
  File "/usr/local/anaconda3/envs/seacrowd/lib/python3.10/site-packages/datasets/download/download_manager.py", line 533, in extract
    extracted_paths = map_nested(
  File "/usr/local/anaconda3/envs/seacrowd/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 464, in map_nested
    mapped = [
  File "/usr/local/anaconda3/envs/seacrowd/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 465, in <listcomp>
    _single_map_nested((function, obj, types, None, True, None))
  File "/usr/local/anaconda3/envs/seacrowd/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 367, in _single_map_nested
    return function(data_struct)
  File "/usr/local/anaconda3/envs/seacrowd/lib/python3.10/site-packages/datasets/utils/file_utils.py", line 211, in cached_path
    output_path = ExtractManager(cache_dir=download_config.cache_dir).extract(
  File "/usr/local/anaconda3/envs/seacrowd/lib/python3.10/site-packages/datasets/utils/extract.py", line 48, in extract
    self.extractor.extract(input_path, output_path, extractor_format)
  File "/usr/local/anaconda3/envs/seacrowd/lib/python3.10/site-packages/datasets/utils/extract.py", line 344, in extract
    return extractor.extract(input_path, output_path)
  File "/usr/local/anaconda3/envs/seacrowd/lib/python3.10/site-packages/datasets/utils/extract.py", line 215, in extract
    rf.extractall(output_path)
  File "/usr/local/anaconda3/envs/seacrowd/lib/python3.10/site-packages/rarfile.py", line 886, in extractall
    dst = self._extract_one(inf, path, pwd, not inf.is_dir())
  File "/usr/local/anaconda3/envs/seacrowd/lib/python3.10/site-packages/rarfile.py", line 951, in _extract_one
    return self._make_file(info, dstfn, pwd, set_attrs)
  File "/usr/local/anaconda3/envs/seacrowd/lib/python3.10/site-packages/rarfile.py", line 966, in _make_file
    shutil.copyfileobj(src, dst)
  File "/usr/local/anaconda3/envs/seacrowd/lib/python3.10/shutil.py", line 195, in copyfileobj
    buf = fsrc_read(length)
  File "/usr/local/anaconda3/envs/seacrowd/lib/python3.10/site-packages/rarfile.py", line 2305, in read
    raise BadRarFile("Failed the read enough data: req=%d got=%d" % (orig, len(data)))
rarfile.BadRarFile: Failed the read enough data: req=65536 got=51
holylovenia commented 5 months ago

@SamuelCahyawijaya @holylovenia Does anyone know how to handle RAR file? I kept getting the following error for this dataset:

  line 184, in _split_generators
    data_dir = dl_manager.download_and_extract(urls)
  File "/usr/local/anaconda3/envs/seacrowd/lib/python3.10/site-packages/datasets/download/download_manager.py", line 561, in download_and_extract
    return self.extract(self.download(url_or_urls))
  File "/usr/local/anaconda3/envs/seacrowd/lib/python3.10/site-packages/datasets/download/download_manager.py", line 533, in extract
    extracted_paths = map_nested(
  File "/usr/local/anaconda3/envs/seacrowd/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 464, in map_nested
    mapped = [
  File "/usr/local/anaconda3/envs/seacrowd/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 465, in <listcomp>
    _single_map_nested((function, obj, types, None, True, None))
  File "/usr/local/anaconda3/envs/seacrowd/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 367, in _single_map_nested
    return function(data_struct)
  File "/usr/local/anaconda3/envs/seacrowd/lib/python3.10/site-packages/datasets/utils/file_utils.py", line 211, in cached_path
    output_path = ExtractManager(cache_dir=download_config.cache_dir).extract(
  File "/usr/local/anaconda3/envs/seacrowd/lib/python3.10/site-packages/datasets/utils/extract.py", line 48, in extract
    self.extractor.extract(input_path, output_path, extractor_format)
  File "/usr/local/anaconda3/envs/seacrowd/lib/python3.10/site-packages/datasets/utils/extract.py", line 344, in extract
    return extractor.extract(input_path, output_path)
  File "/usr/local/anaconda3/envs/seacrowd/lib/python3.10/site-packages/datasets/utils/extract.py", line 215, in extract
    rf.extractall(output_path)
  File "/usr/local/anaconda3/envs/seacrowd/lib/python3.10/site-packages/rarfile.py", line 886, in extractall
    dst = self._extract_one(inf, path, pwd, not inf.is_dir())
  File "/usr/local/anaconda3/envs/seacrowd/lib/python3.10/site-packages/rarfile.py", line 951, in _extract_one
    return self._make_file(info, dstfn, pwd, set_attrs)
  File "/usr/local/anaconda3/envs/seacrowd/lib/python3.10/site-packages/rarfile.py", line 966, in _make_file
    shutil.copyfileobj(src, dst)
  File "/usr/local/anaconda3/envs/seacrowd/lib/python3.10/shutil.py", line 195, in copyfileobj
    buf = fsrc_read(length)
  File "/usr/local/anaconda3/envs/seacrowd/lib/python3.10/site-packages/rarfile.py", line 2305, in read
    raise BadRarFile("Failed the read enough data: req=%d got=%d" % (orig, len(data)))
rarfile.BadRarFile: Failed the read enough data: req=65536 got=51

Hi @fhudi, I've never personally tried to make a dataloader whose source is a rar file. If it's not possible to extract using the download manager, could you try using other libraries to unrar the data?

cc: @sabilmakbar @SamuelCahyawijaya in case they know better than me.

fhudi commented 5 months ago

@holylovenia Just FYI, the error I posted is after the installation of pip install rarfile, the default / recommended from the HF datasets. I will try exploring the third-party libraries then. Hopefully there is one that can handle the files. Thanks. 🙏

holylovenia commented 4 months ago

@holylovenia Just FYI, the error I posted is after the installation of pip install rarfile, the default / recommended from the HF datasets. I will try exploring the third-party libraries then. Hopefully there is one that can handle the files. Thanks. 🙏

Sorry for the wait. @sabilmakbar and I are currently discussing about this. We will let you know shortly.

akhdanfadh commented 4 months ago

The extracted dataset is indeed big: 468MB train_set.rar results in an astonishing 1.94GB train_set.csv.

@fhudi Have you tried directly using rarfile module? I am able to extract it directly.

Image