IndoNLP / nusa-crowd

A collaborative project to collect datasets in Indonesian languages.
Apache License 2.0
262 stars 62 forks source link

Liputan6 xtreme_train.json file incomplete #338

Closed gregoriomario closed 1 year ago

gregoriomario commented 1 year ago

Describe the bug

One of the file in Liputan6 dataset, titled xtreme_train.json is inclomplete. After a brief moment of exploring the data, it seems that the file is incomplete, making parsing it to json impossible.

Steps to reproduce the bug

The data itself could be downloded from the catalogue, for some reason the data that's supposed to be loaded from huggingface also contains error on it's preprocessing.

import json
import os

FILE_PATH = os.path.join('downstream_task_datasets', 'IndoNLG_downstream_tasks','liputan6','xtreme_train.json')

with open(FILE_PATH, 'r') as reader:
    json.load(reader)

Expected results

The data is loaded successfully.

Actual results

When loaded with json package, it throws the error below

JSONDecodeError                           Traceback (most recent call last)
d:\Titan\coding\AI\dataset\summarization\liputan6.ipynb Cell 3 in <cell line: 1>()
      [1](vscode-notebook-cell:/d%3A/Titan/coding/AI/dataset/summarization/liputan6.ipynb#W2sZmlsZQ%3D%3D?line=0) with open(FILE_PATH, 'r') as reader:
----> [2](vscode-notebook-cell:/d%3A/Titan/coding/AI/dataset/summarization/liputan6.ipynb#W2sZmlsZQ%3D%3D?line=1)     json.load(reader)

File c:\Users\Titan\anaconda3\lib\json\__init__.py:293, in load(fp, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    274 def load(fp, *, cls=None, object_hook=None, parse_float=None,
    275         parse_int=None, parse_constant=None, object_pairs_hook=None, **kw):
    276     """Deserialize ``fp`` (a ``.read()``-supporting file-like object containing
    277     a JSON document) to a Python object.
    278 
   (...)
    291     kwarg; otherwise ``JSONDecoder`` is used.
    292     """
--> 293     return loads(fp.read(),
    294         cls=cls, object_hook=object_hook,
    295         parse_float=parse_float, parse_int=parse_int,
    296         parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)

File c:\Users\Titan\anaconda3\lib\json\__init__.py:346, in loads(s, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    341     s = s.decode(detect_encoding(s), 'surrogatepass')
    343 if (cls is None and object_hook is None and
    344         parse_int is None and parse_float is None and
    345         parse_constant is None and object_pairs_hook is None and not kw):
...
--> 353     obj, end = self.scan_once(s, idx)
    354 except StopIteration as err:
    355     raise JSONDecodeError("Expecting value", s, err.value) from None

JSONDecodeError: Unterminated string starting at: line 1 column 296549065 (char 296549064)

I tried to explore what cause the problem and turn out that the file is incomplete

>>> with open(FILE_PATH, 'r') as reader:
>>>   data = reader.read()
>>> data[-1500:]
'kan di Pelabuhan Ulhele , .... 26 Juni mendatang ."}, {"id": 233622, "text": "Liputan6 . com , Jakarta : ........ orang yang berduit dan oknu'

Environment info

SamuelCahyawijaya commented 1 year ago

@gregoriomario : Thank you for the report and sorry for the late reply. I will check this one further this weekend. I will get back to you ASAP.

SamuelCahyawijaya commented 1 year ago

Hi @gregoriomario, you can load the dataset from our codebase instead of downloading directly from the source dataset.

you can install the package using pip install -e git+https://github.com/IndoNLP/nusa-crowd.git, and then you can load the dataset using

from nusacrowd import NusantaraConfigHelper
conhelps = NusantaraConfigHelper()
liputan6_datasets = conhelps.filtered(lambda x: ("liputan6_xtreme" in x.config.name and x.is_nusantara_schema))[0].load_dataset()

It will return a a Dataset class object from the datasets package as shown below

Screenshot 2023-03-05 at 10 00 54 AM

You can check the implementation of the dataloader here: https://github.com/IndoNLP/nusa-crowd/blob/master/nusacrowd/nusa_datasets/liputan6/liputan6.py

Hope it helps and let me know if there is any further issue.