Closed patrickamadeus closed 4 months ago
It's done! @sabilmakbar thankyou for the review ☺️
By the way, @patrickamadeus, did you happen to observe the different number of samples generated from your dataloader implementation compared to the one reported in their GH?
I noticed two subsets have minor amt of sample diff:
Subset VLSP20-official
789 (reported on Src GH) vs 790 (generated examples)wiki-alt
20000 (reported on Src GH) vs 20106 (generated examples)By the way, @patrickamadeus, did you happen to observe the different number of samples generated from your dataloader implementation compared to the one reported in their GH?
I noticed two subsets have minor amt of sample diff:
Subset VLSP20-official
789 (reported on Src GH) vs 790 (generated examples)wiki-alt
20000 (reported on Src GH) vs 20106 (generated examples)
Hi @sabilmakbar ! That's interesting, they mentioned that they have N
sentences. Actually I splitted the dataset every \n
instead of assuming .
for the end of the sentence (would even explode the numbers even more).
Since they don't give any clear definition of how to define each "sentence", do you think it's reasonable for now if we assume each line for each sentence? Because I have reviewed each dataset and it's true that each generated example's number matches to each file's line count.
Hi @sabilmakbar ! That's interesting, they mentioned that they have N sentences. Actually I splitted the dataset every \n instead of assuming . for the end of the sentence (would even explode the numbers even more).
Since they don't give any clear definition of how to define each "sentence", do you think it's reasonable for now if we assume each line for each sentence? Because I have reviewed each dataset and it's true that each generated example's number matches to each file's line count.
Okay then, since I think everything else has the same number as reported, we can acknowledge it (prob adding it to inline comment would be better).
thanks for the work, @patrickamadeus! let's wait for @raileymontalan's review
Hi @raileymontalan, I would like to let you know that we plan to finalize the calculation of the open contributions (e.g., dataloader implementations) in 31 hours, so it'd be great if we could wrap up the reviewing and merge this PR before then.
cc: @patrickamadeus
Closes #629
Checkbox
seacrowd/sea_datasets/{my_dataset}/{my_dataset}.py
(please use only lowercase and underscore for dataset folder naming, as mentioned in dataset issue) and its__init__.py
within{my_dataset}
folder._CITATION
,_DATASETNAME
,_DESCRIPTION
,_HOMEPAGE
,_LICENSE
,_LOCAL
,_URLs
,_SUPPORTED_TASKS
,_SOURCE_VERSION
, and_SEACROWD_VERSION
variables._info()
,_split_generators()
and_generate_examples()
in dataloader script.BUILDER_CONFIGS
class attribute is a list with at least oneSEACrowdConfig
for the source schema and one for a seacrowd schema.datasets.load_dataset
function.python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py
orpython -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py --subset_id {subset_name_without_source_or_seacrowd_suffix}
.Tests
indomain-news (3 splits)
basic (1 split)
VLSP20-official (1 split)
NB
openSub
&mono-vi
for future development (Large Drive file download bottleneck), already opened thread in Discord discussion.