Closed raileymontalan closed 6 months ago
Hi @raileymontalan, looks good to me! I have one suggestion though.
Could you please add the
author
,category
,date
,img_url
,url
, andwebsite
under themetadata
in theseacrowd_imtext
schema?
Details added to metadata. Ready for review, thanks!
Closes #12
Notes
There are many articles in the dataset whose corresponding image file does not exist in the repository. Reporting the statistics here:
train
split (281403 total examples): 18923 missing images (~6.72%)test
split (35177 total examples): 2356 missing images (~6.70%)validation
split (35175 total examples): 2369 missing images (~6.73%)Checkbox
seacrowd/sea_datasets/{my_dataset}/{my_dataset}.py
(please use only lowercase and underscore for dataset folder naming, as mentioned in dataset issue) and its__init__.py
within{my_dataset}
folder._CITATION
,_DATASETNAME
,_DESCRIPTION
,_HOMEPAGE
,_LICENSE
,_LOCAL
,_URLs
,_SUPPORTED_TASKS
,_SOURCE_VERSION
, and_SEACROWD_VERSION
variables._info()
,_split_generators()
and_generate_examples()
in dataloader script.BUILDER_CONFIGS
class attribute is a list with at least oneSEACrowdConfig
for the source schema and one for a seacrowd schema.datasets.load_dataset
function.python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py
orpython -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py --subset_id {subset_name_without_source_or_seacrowd_suffix}
.