akhdanfadh commented 3 months ago

Closes #206

Sorry if this PR is made before any final decision in #206, I just want to rush things.

Some possible discussions:

Add another task for OCR (see #555). After that one is merged, I will remove the corresponding file here.
Used third-party libraries to download the data, i.e., pip install gdown. I am aware that I should made a separate PR for this, just want to put this here first for reviewers to discuss the implementation. Related discussion is on #206.
I store the downloaded data in data/sleukrith_ocr/, thus the updated .gitignore. Again, should be on a separate PR, right?
The raw image dataset is in binary. I convert the image data as numpy arrays, and then save the image as a file so I can store the image path for the schema (see _generate_examples()). What do you think about that? Should I just store the image as numpy arrays instead?

Checkbox

[x] Confirm that this PR is linked to the dataset issue.
[x] Create the dataloader script seacrowd/sea_datasets/{my_dataset}/{my_dataset}.py (please use only lowercase and underscore for dataset folder naming, as mentioned in dataset issue) and its __init__.py within {my_dataset} folder.
[x] Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _LOCAL, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _SEACROWD_VERSION variables.
[x] Implement _info(), _split_generators() and _generate_examples() in dataloader script.
[x] Make sure that the BUILDER_CONFIGS class attribute is a list with at least one SEACrowdConfig for the source schema and one for a seacrowd schema.
[x] Confirm dataloader script works with datasets.load_dataset function.
[x] Confirm that your dataloader script passes the test suite run with python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py or python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py --subset_id {subset_name_without_source_or_seacrowd_suffix}.
[ ] If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

akhdanfadh commented 3 months ago

@holylovenia Resolve some changes and added additional package handling.

This task has to be added in a separate PR.

I'll make the new PR once everything on the dataloader part is done and reviewed.

akhdanfadh commented 2 months ago

IDK why constants.py is still listed as a changed file here even tho I already changed it by running git checkout upstream/master -- seacrowd/utils/constants.py. I think I will delete the file manually then.

I think I also need to make a new PR for adding data/ in .gitignore, right? @holylovenia

holylovenia commented 2 months ago

I think I also need to make a new PR for adding data/ in .gitignore, right? @holylovenia

Yes yes, I'll approve that PR once you create it.

akhdanfadh commented 2 months ago

Yes yes, I'll approve that PR once you create it.

658 and files removed. This PR should be done on my part.

holylovenia commented 2 months ago

A friendly reminder for @akhdanfadh to address @sabilmakbar's suggestions. 🙏

akhdanfadh commented 2 months ago

@sabilmakbar Done, please check my replies.

akhdanfadh commented 2 months ago

@sabilmakbar Done with the inline comments.

sabilmakbar commented 2 months ago

lgtm, thanks @akhdanfadh!

SEACrowd / seacrowd-datahub

Closes #206 | Add Dataloader SleukRith Set #556

Checkbox

658 and files removed. This PR should be done on my part.