Closed sabilmakbar closed 4 months ago
Heads-up for reviewers:
_generate_examples
requires file reconstruction and validation (due to the nature of scraped data requires more validation), in my machine it only generate around 3-4 examples/s and has ~57K examples in total update: the code now is optimized, can creates the examples in much better pace (300ex/s in my machine)
update 2: I find using gdown
now works. prob the issue back then related to URL construction that results a warning HTML being downloaded instead of the actual file
If these secrets were true positive and are still valid, we highly recommend you to revoke them. Once a secret has been leaked into a git repository, you should consider it compromised, even if it was deleted immediately. Find here more information about risks.
🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.
pushed-force to remove secrets in earlier commits
Hi @patrickamadeus, I would like to let you know that we plan to finalize the calculation of the open contributions (e.g., dataloader implementations) in 31 hours, so it'd be great if we could wrap up the reviewing and merge this PR before then.
cc: @sabilmakbar
done making changes, pls have a look @fhudi
@sabilmakbar cool. thanks. The test is currently running, since it is a 15.9GB file, please wait for a while
@sabilmakbar Other than the above mentioned issue, everything else work fine. Passed the test and reviewing check-list. 🙏
Pillow library is required. Shall we add try-except for checking?
Hi @fhudi, I thought PIL
was included in SEACrowd reqs (but in fact it isn't). I'll add the try-catch exception for now so that every user can avoid downloading it only to find the error when generating the dataset due to missing PIL
lib. Thanks for pointing this
Please name your PR title and the first line of PR message after the issue it will close. You can use the following examples: Closes #228
Checkbox
seacrowd/sea_datasets/{my_dataset}/{my_dataset}.py
(please use only lowercase and underscore for dataset folder naming, as mentioned in dataset issue) and its__init__.py
within{my_dataset}
folder._CITATION
,_DATASETNAME
,_DESCRIPTION
,_HOMEPAGE
,_LICENSE
,_LOCAL
,_URLs
,_SUPPORTED_TASKS
,_SOURCE_VERSION
, and_SEACROWD_VERSION
variables._info()
,_split_generators()
and_generate_examples()
in dataloader script.BUILDER_CONFIGS
class attribute is a list with at least oneSEACrowdConfig
for the source schema and one for a seacrowd schema.datasets.load_dataset
function.python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py
orpython -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py --subset_id {subset_name_without_source_or_seacrowd_suffix}
.