Closed akhdanfadh closed 1 month ago
Replacing @danjohnvelasco with @yongzx due to inactivity.
Done addressing @raileymontalan reviews. Waiting for @yongzx's.
Thanks @akhdanfadh. The unit tests successfully run on my end and the code LGTM!
Hi @raileymontalan, do you have anything else to suggest? If @akhdanfadh has addressed all your concerns, I'm inclined to merge this PR.
Closes #537
This dataset is MASSIVE, and it seems the seacrowd test must download ALL data for it to be OK. I downloaded and loaded all the files (~100gb) and successfully tested it. I'm putting the result here: bud500.txt
For those with limited internet quota, I'm suggesting to load the data in Python's REPL, passing
streaming=True
as follows:Checkbox
seacrowd/sea_datasets/my_dataset/my_dataset.py
(please use only lowercase and underscore for dataset naming)._CITATION
,_DATASETNAME
,_DESCRIPTION
,_HOMEPAGE
,_LICENSE
,_URLs
,_SUPPORTED_TASKS
,_SOURCE_VERSION
, and_SEACROWD_VERSION
variables._info()
,_split_generators()
and_generate_examples()
in dataloader script.BUILDER_CONFIGS
class attribute is a list with at least oneSEACrowdConfig
for the source schema and one for a seacrowd schema.datasets.load_dataset
function.python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py
.