SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
57 stars 54 forks source link

Closes #206 | Add Dataloader SleukRith Set #556

Closed akhdanfadh closed 2 months ago

akhdanfadh commented 3 months ago

Closes #206

Sorry if this PR is made before any final decision in #206, I just want to rush things.

Some possible discussions:

  1. Add another task for OCR (see #555). After that one is merged, I will remove the corresponding file here.
  2. Used third-party libraries to download the data, i.e., pip install gdown. I am aware that I should made a separate PR for this, just want to put this here first for reviewers to discuss the implementation. Related discussion is on #206.
  3. I store the downloaded data in data/sleukrith_ocr/, thus the updated .gitignore. Again, should be on a separate PR, right?
  4. The raw image dataset is in binary. I convert the image data as numpy arrays, and then save the image as a file so I can store the image path for the schema (see _generate_examples()). What do you think about that? Should I just store the image as numpy arrays instead?

Checkbox

akhdanfadh commented 3 months ago

@holylovenia Resolve some changes and added additional package handling.

This task has to be added in a separate PR.

I'll make the new PR once everything on the dataloader part is done and reviewed.

akhdanfadh commented 2 months ago

IDK why constants.py is still listed as a changed file here even tho I already changed it by running git checkout upstream/master -- seacrowd/utils/constants.py. I think I will delete the file manually then.

I think I also need to make a new PR for adding data/ in .gitignore, right? @holylovenia

holylovenia commented 2 months ago

I think I also need to make a new PR for adding data/ in .gitignore, right? @holylovenia

Yes yes, I'll approve that PR once you create it.

akhdanfadh commented 2 months ago

Yes yes, I'll approve that PR once you create it.

658 and files removed. This PR should be done on my part.

holylovenia commented 2 months ago

A friendly reminder for @akhdanfadh to address @sabilmakbar's suggestions. 🙏

akhdanfadh commented 2 months ago

@sabilmakbar Done, please check my replies.

akhdanfadh commented 2 months ago

@sabilmakbar Done with the inline comments.

sabilmakbar commented 2 months ago

lgtm, thanks @akhdanfadh!