SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
64 stars 57 forks source link

Closes #228 | Add M3LS dataloader #675

Closed sabilmakbar closed 4 months ago

sabilmakbar commented 4 months ago

Please name your PR title and the first line of PR message after the issue it will close. You can use the following examples: Closes #228

Checkbox

sabilmakbar commented 4 months ago

Heads-up for reviewers:

  1. Total data is ~14GB
  2. The _generate_examples requires file reconstruction and validation (due to the nature of scraped data requires more validation), in my machine it only generate around 3-4 examples/s and has ~57K examples in total
  3. It has a warning on GitGuardian (secret detection) due to the GDrive API key being used for constructing a download URL from GDrive (with bypass to large file download warning). Other methods that I tested resulted in failure.

update: the code now is optimized, can creates the examples in much better pace (300ex/s in my machine)

update 2: I find using gdown now works. prob the issue back then related to URL construction that results a warning HTML being downloaded instead of the actual file

gitguardian[bot] commented 4 months ago

️✅ There are no secrets present in this pull request anymore.

If these secrets were true positive and are still valid, we highly recommend you to revoke them. Once a secret has been leaked into a git repository, you should consider it compromised, even if it was deleted immediately. Find here more information about risks.


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

sabilmakbar commented 4 months ago

pushed-force to remove secrets in earlier commits

holylovenia commented 4 months ago

Hi @patrickamadeus, I would like to let you know that we plan to finalize the calculation of the open contributions (e.g., dataloader implementations) in 31 hours, so it'd be great if we could wrap up the reviewing and merge this PR before then.

cc: @sabilmakbar

sabilmakbar commented 4 months ago

done making changes, pls have a look @fhudi

fhudi commented 4 months ago

@sabilmakbar cool. thanks. The test is currently running, since it is a 15.9GB file, please wait for a while

fhudi commented 4 months ago

@sabilmakbar Other than the above mentioned issue, everything else work fine. Passed the test and reviewing check-list. 🙏

sabilmakbar commented 4 months ago

Pillow library is required. Shall we add try-except for checking?

Hi @fhudi, I thought PIL was included in SEACrowd reqs (but in fact it isn't). I'll add the try-catch exception for now so that every user can avoid downloading it only to find the error when generating the dataset due to missing PIL lib. Thanks for pointing this