I implemented one config per language/subset. Thus, configs will look like this: bactrian_x_id_source, bactrian_x_km_seacrowd_t2t, etc. When testing, pass bactrian_x_<subset> to the --subset_id parameter.
As there is one more variable for the input response in the source schema, I added that manually as Instruction: {instruction}\nInput: {input}" in text_1 of seacrowd_t2t schema. I don't know if that is allowed, so let's discuss.
Note that for Khmer subset, the loaded data will look as follows:
At first, I thought this should be an encoding problem and need to be solved. But turns out I also get the same result when loading from HF directly as follows:
[x] Confirm that this PR is linked to the dataset issue.
[x] Create the dataloader script seacrowd/sea_datasets/{my_dataset}/{my_dataset}.py (please use only lowercase and underscore for dataset folder naming, as mentioned in dataset issue) and its __init__.py within {my_dataset} folder.
[x] Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _LOCAL, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _SEACROWD_VERSION variables.
[x] Implement _info(), _split_generators() and _generate_examples() in dataloader script.
[x] Make sure that the BUILDER_CONFIGS class attribute is a list with at least one SEACrowdConfig for the source schema and one for a seacrowd schema.
[x] Confirm dataloader script works with datasets.load_dataset function.
[x] Confirm that your dataloader script passes the test suite run with python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py or python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py --subset_id {subset_name_without_source_or_seacrowd_suffix}.
[ ] If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.
Closes #424
I implemented one config per language/subset. Thus, configs will look like this:
bactrian_x_id_source
,bactrian_x_km_seacrowd_t2t
, etc. When testing, passbactrian_x_<subset>
to the--subset_id
parameter.As there is one more variable for the input response in the source schema, I added that manually as
Instruction: {instruction}\nInput: {input}"
intext_1
ofseacrowd_t2t
schema. I don't know if that is allowed, so let's discuss.Note that for Khmer subset, the loaded data will look as follows:
At first, I thought this should be an encoding problem and need to be solved. But turns out I also get the same result when loading from HF directly as follows:
Checkbox
seacrowd/sea_datasets/{my_dataset}/{my_dataset}.py
(please use only lowercase and underscore for dataset folder naming, as mentioned in dataset issue) and its__init__.py
within{my_dataset}
folder._CITATION
,_DATASETNAME
,_DESCRIPTION
,_HOMEPAGE
,_LICENSE
,_LOCAL
,_URLs
,_SUPPORTED_TASKS
,_SOURCE_VERSION
, and_SEACROWD_VERSION
variables._info()
,_split_generators()
and_generate_examples()
in dataloader script.BUILDER_CONFIGS
class attribute is a list with at least oneSEACrowdConfig
for the source schema and one for a seacrowd schema.datasets.load_dataset
function.python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py
orpython -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py --subset_id {subset_name_without_source_or_seacrowd_suffix}
.