Closed akhdanfadh closed 2 months ago
Have you accept the acknowledgement or form in the https://huggingface.co/datasets/ontocord/CulturaY ? It seems to work on my end. It's just the subset are quite a few and need some time--and spaces! :( --to run.
Cannot access gated repo for URL https://huggingface.co/api/datasets/ontocord/CulturaY. Access to dataset ontocord/CulturaY is restricted and you are not in the authorized list. Visit https://huggingface.co/datasets/ontocord/CulturaY to ask for access.
Exactly as @khelli07 said, @MJonibek need to accept the acknowledgment in the dataset repo. I already describe it in the description section of the dataloader actually.
It's just the subset are quite a few and need some time--and spaces! :( --to run.
IKR, the dataset is huge. Not sure if there is something I can improve on this.
A friendly reminder for @akhdanfadh to address @MJonibek and @khelli07's reviews.
@MJonibek Done! Also, please add your review or accept if LGTY @khelli07
Closes #535
I implemented one config per language/subset. Thus, configs will look like this:
culturay_id_source
,culturay_my_seacrowd_ssp
, etc. When testing, passculturay_<subset>
to the--subset_id
parameter.Due to the large dataset, it will take time to test.
Checkbox
seacrowd/sea_datasets/my_dataset/my_dataset.py
(please use only lowercase and underscore for dataset naming)._CITATION
,_DATASETNAME
,_DESCRIPTION
,_HOMEPAGE
,_LICENSE
,_URLs
,_SUPPORTED_TASKS
,_SOURCE_VERSION
, and_SEACROWD_VERSION
variables._info()
,_split_generators()
and_generate_examples()
in dataloader script.BUILDER_CONFIGS
class attribute is a list with at least oneSEACrowdConfig
for the source schema and one for a seacrowd schema.datasets.load_dataset
function.python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py
.