Closed gregoriomario closed 1 year ago
@gregoriomario : Thank you for the report and sorry for the late reply. I will check this one further this weekend. I will get back to you ASAP.
Hi @gregoriomario, you can load the dataset from our codebase instead of downloading directly from the source dataset.
you can install the package using pip install -e git+https://github.com/IndoNLP/nusa-crowd.git
, and then you can load the dataset using
from nusacrowd import NusantaraConfigHelper
conhelps = NusantaraConfigHelper()
liputan6_datasets = conhelps.filtered(lambda x: ("liputan6_xtreme" in x.config.name and x.is_nusantara_schema))[0].load_dataset()
It will return a a Dataset
class object from the datasets
package as shown below
You can check the implementation of the dataloader here: https://github.com/IndoNLP/nusa-crowd/blob/master/nusacrowd/nusa_datasets/liputan6/liputan6.py
Hope it helps and let me know if there is any further issue.
Describe the bug
One of the file in Liputan6 dataset, titled
xtreme_train.json
is inclomplete. After a brief moment of exploring the data, it seems that the file is incomplete, making parsing it to json impossible.Steps to reproduce the bug
The data itself could be downloded from the catalogue, for some reason the data that's supposed to be loaded from huggingface also contains error on it's preprocessing.
Expected results
The data is loaded successfully.
Actual results
When loaded with json package, it throws the error below
I tried to explore what cause the problem and turn out that the file is incomplete
Environment info