Closed akhdanfadh closed 5 months ago
@akhdanfadh
Code itself looks good to me, just need to add try-except for third-party imports like in #556. Let's also follow that PR for what to do re: storing downloading data.
I found that the dataset comes from this paper linking to this Github repo, so you can change dataloader metadata accordingly.
@holylovenia Datasheet also needs to be updated with paper details.
@elyanah-aco Done!
@holylovenia Datasheet also needs to be updated with paper details.
@elyanah-aco You meant the paper title, or...?
Closes #531
Similar to #556: I use third-party libraries to download the GDrive data, i.e.,
pip install gdown
, because it is more reliable than thedl_manager
. Similarly, I also store the downloaded data indata/multilingual_alpaca/
. I am aware that I should make a PR for those two things, just waiting for further instructions.Checkbox
seacrowd/sea_datasets/{my_dataset}/{my_dataset}.py
(please use only lowercase and underscore for dataset folder naming, as mentioned in dataset issue) and its__init__.py
within{my_dataset}
folder._CITATION
,_DATASETNAME
,_DESCRIPTION
,_HOMEPAGE
,_LICENSE
,_LOCAL
,_URLs
,_SUPPORTED_TASKS
,_SOURCE_VERSION
, and_SEACROWD_VERSION
variables._info()
,_split_generators()
and_generate_examples()
in dataloader script.BUILDER_CONFIGS
class attribute is a list with at least oneSEACrowdConfig
for the source schema and one for a seacrowd schema.datasets.load_dataset
function.python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py
orpython -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py --subset_id {subset_name_without_source_or_seacrowd_suffix}
.