Closed mrqorib closed 4 months ago
I apologize that the commits of medisco are also dragged into this PR, I created the branch from medisco's branch. Please just check the last two commits.
I have tested the dataloader manually, but please note that I only tested on a subset of the data as I kept failing to download the whole 1.2TB data. The subset is representative of the whole directory structures of the data.
Hi @mrqorib ! Saw your comment regarding the medisco commits. Is it possible to separate them? It might require some surgery (the simplest approach would be to use git --rebase onto X Y
) but at least it's easier in the long-term to track which PR does what! Thank you so much!
Hi @mrqorib, I would like to let you know that we plan to finalize the calculation of the open contributions (e.g., dataloader implementations) in 31 hours, so it'd be great if we could wrap up the reviewing and merge this PR before then.
@ljvmiranda921 @muhammadravi251001, since @mrqorib provided the unit test result, could you please do the review based on the implementation code? 🤔 Also feel free to ask @mrqorib to generate any outputs that would help your review process for this local dataloader.
Hi @muhammadravi251001, could we wait until @mrqorib remove the changes done to the medisco
dataloader?
Hi @muhammadravi251001, could we wait until @mrqorib remove the changes done to the
medisco
dataloader?
Sure, it's up to you kak actually, since <31 hours remaining. For now I can't remove medisco
dataloader by myself
Hi @muhammadravi251001, could we wait until @mrqorib remove the changes done to the
medisco
dataloader?Sure, it's up to you kak actually, since <31 hours remaining. For now I can't remove
medisco
dataloader by myself
BTW @holylovenia, can we just delete both of the medisco
files inside of the Files Changed tab here? Is that alright?
Hi all, thanks for your help with reviewing the PR. Sorry for the slow response, I was a bit busy. I can try to tidy up the medisco mess tonight. @holylovenia Please let me know if just deleting the medisco files would be fine in case the rebase method suggested by @ljvmiranda921 failed
Alright, after deleting the medisco
commits via File Changed tab, now it's cool. Now, we can squash and merge this PR. Thanks for the contribution @mrqorib!
Hi all, thanks for your help with reviewing the PR. Sorry for the slow response, I was a bit busy. I can try to tidy up the medisco mess tonight. @holylovenia Please let me know if just deleting the medisco files would be fine in case the rebase method suggested by @ljvmiranda921 failed
I think just deleting medisco
files can do it for now, because this PR now doesn't have any other dataloader implementation. I guess it is the main point.
Closes #528
Checkbox
seacrowd/sea_datasets/{my_dataset}/{my_dataset}.py
(please use only lowercase and underscore for dataset folder naming, as mentioned in dataset issue) and its__init__.py
within{my_dataset}
folder._CITATION
,_DATASETNAME
,_DESCRIPTION
,_HOMEPAGE
,_LICENSE
,_LOCAL
,_URLs
,_SUPPORTED_TASKS
,_SOURCE_VERSION
, and_SEACROWD_VERSION
variables._info()
,_split_generators()
and_generate_examples()
in dataloader script.BUILDER_CONFIGS
class attribute is a list with at least oneSEACrowdConfig
for the source schema and one for a seacrowd schema.datasets.load_dataset
function.python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py
orpython -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py --subset_id {subset_name_without_source_or_seacrowd_suffix}
.