Closed Gyyz closed 1 month ago
Scripts Passed.
#!/bin/bash
# Define the list of language codes
LANGUAGES=(
aaz abx ace agn agt ahk akb alj alp amk
aoz atb atd att ban bbc bcl bgr bgs bgz
bhp bkd bku blw blz bpr bps bru btd bth
bto bts btx bug bvz bzi cbk ceb cfm
cgc clu cmo cnh cnw csy ctd czt dgc dtp
due duo ebk fil gbi gor heg hil hnj hnn
hvn iba ifa ifb ifk ifu ify ilo ind iry
isd itv ium ivb ivv jav jra kac khm kix
kje kmk kne kqe krj ksc ksw kxm lao lbk
lew lex lhi lhu ljp lus mad mak mbb mbd
mbf mbi mbs mbt mej mkn mnb mog mqj mqy
mrw msb msk msm mta mtg mtj mvp mwq
mwv mya nbe nfa nia nij nlc npy obo
pag pam plw pmf pne ppk prf prk ptu pww
sas sbl sda sgb smk sml sun sxn szb tbl
tby tcz tdt tgl tha tih tlb twu urk vie
war whk wrs xbr yli yva zom zyp pse
mnx mmn lsi hlt gdg bnj acn
)
# Loop through each language code
for lang in "${LANGUAGES[@]}"; do
echo "Running subset_id for language: $lang"
python -m tests.test_seacrowd seacrowd/sea_datasets/unisent/unisent.py --subset_id unisent_"$lang"
done
make check_file=seacrowd/sea_datasets/unisent/unisent.py
this script would reformat the _LANGUAGES list to a unfriendly one.
Hi @Gyyz, I would like to let you know that we plan to finalize the calculation of the open contributions (e.g., dataloader implementations) by 30 May, so it'd be great if we could wrap up the reviewing and merge this PR before then.
And also, I still get the errors for testing these
subset_id
: agt, bru, bth, bzq, bzi, ium, ivb, jra, khm, ksw, kxm, lao, lhu, wmm, mya, ntx, pww, sml, sxn, tha, urk, vie, zlm
Please make sure that you are using my original code without modifying the below, I didn't pass the split
args to my function as I didn't use this args:
name=datasets.Split.TRAIN,
gen_kwargs={
"filepath": os.path.join(data_dir),
"split": "train",
},)
And also, I still get the errors for testing these
subset_id
: agt, bru, bth, bzq, bzi, ium, ivb, jra, khm, ksw, kxm, lao, lhu, wmm, mya, ntx, pww, sml, sxn, tha, urk, vie, zlmPlease make sure that you are using my original code without modifying the below, I didn't pass the
split
args to my function as I didn't use this args:name=datasets.Split.TRAIN, gen_kwargs={ "filepath": os.path.join(data_dir), "split": "train", },)
Sure, I don't modify your code at all before testing the code. How about in your end @holylovenia, can you pass the test with these subset_id
?
python -m tests.test_seacrowd seacrowd/sea_datasets/unisent/unisent.py --subset_id unisent_agt
python -m tests.test_seacrowd seacrowd/sea_datasets/unisent/unisent.py --subset_id unisent_bru
python -m tests.test_seacrowd seacrowd/sea_datasets/unisent/unisent.py --subset_id unisent_bth
python -m tests.test_seacrowd seacrowd/sea_datasets/unisent/unisent.py --subset_id unisent_bzq
python -m tests.test_seacrowd seacrowd/sea_datasets/unisent/unisent.py --subset_id unisent_bzi
python -m tests.test_seacrowd seacrowd/sea_datasets/unisent/unisent.py --subset_id unisent_ium
python -m tests.test_seacrowd seacrowd/sea_datasets/unisent/unisent.py --subset_id unisent_ivb
python -m tests.test_seacrowd seacrowd/sea_datasets/unisent/unisent.py --subset_id unisent_jra
python -m tests.test_seacrowd seacrowd/sea_datasets/unisent/unisent.py --subset_id unisent_khm
python -m tests.test_seacrowd seacrowd/sea_datasets/unisent/unisent.py --subset_id unisent_ksw
python -m tests.test_seacrowd seacrowd/sea_datasets/unisent/unisent.py --subset_id unisent_kxm
python -m tests.test_seacrowd seacrowd/sea_datasets/unisent/unisent.py --subset_id unisent_lao
python -m tests.test_seacrowd seacrowd/sea_datasets/unisent/unisent.py --subset_id unisent_lhu
python -m tests.test_seacrowd seacrowd/sea_datasets/unisent/unisent.py --subset_id unisent_wmm
python -m tests.test_seacrowd seacrowd/sea_datasets/unisent/unisent.py --subset_id unisent_mya
python -m tests.test_seacrowd seacrowd/sea_datasets/unisent/unisent.py --subset_id unisent_ntx
python -m tests.test_seacrowd seacrowd/sea_datasets/unisent/unisent.py --subset_id unisent_pww
python -m tests.test_seacrowd seacrowd/sea_datasets/unisent/unisent.py --subset_id unisent_sml
python -m tests.test_seacrowd seacrowd/sea_datasets/unisent/unisent.py --subset_id unisent_sxn
python -m tests.test_seacrowd seacrowd/sea_datasets/unisent/unisent.py --subset_id unisent_tha
python -m tests.test_seacrowd seacrowd/sea_datasets/unisent/unisent.py --subset_id unisent_urk
python -m tests.test_seacrowd seacrowd/sea_datasets/unisent/unisent.py --subset_id unisent_vie
python -m tests.test_seacrowd seacrowd/sea_datasets/unisent/unisent.py --subset_id unisent_zlm
I still get the UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 3: character maps to <undefined>
and raise DatasetGenerationError("An error occurred while generating the dataset") from e datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset
error.
hi @holylovenia, may I know if you could handle this PR review (since all your prev comments have been addressed)? I'll help with this one if you won't be able to do that
I guess you should modify the encoding
parameter in the open
function to:
with open(filepath, "r", encoding="utf-8") as filein:
Using this, I get an OK response from tests.seacrowd
and can load the dataset perfectly with the .load_dataset()
method.
The issue is that when I run this data loader implementation on my Windows laptop, it doesn't always default to utf-8
encoding; whereas on Linux and MacOS, the default is utf-8
.
To prevent future-encoding-error, I suggest adding an encoding
parameter to the open
method, @Gyyz.
This exact same error is similar to the issue I mentioned in my comment here.
After adding encoding
parameter, still for these subsets
--bzq
, wmm
, ntx
, zlm
it throws these errors:
FileNotFoundError: Couldn't find file at https://raw.githubusercontent.com/ehsanasgari/UniSent/master/unisent_lexica_v1/bzq_unisent_lexicon.txt
FileNotFoundError: Couldn't find file at https://raw.githubusercontent.com/ehsanasgari/UniSent/master/unisent_lexica_v1/wmm_unisent_lexicon.txt
FileNotFoundError: Couldn't find file at https://raw.githubusercontent.com/ehsanasgari/UniSent/master/unisent_lexica_v1/ntx_unisent_lexicon.txt
FileNotFoundError: Couldn't find file at https://raw.githubusercontent.com/ehsanasgari/UniSent/master/unisent_lexica_v1/zlm_unisent_lexicon.txt
I guess the _LANGUAGE
naming for those subsets
is still unaligned, you need to find the right subset for those particular languages. Find the language name on here.
hi @holylovenia, may I know if you could handle this PR review (since all your prev comments have been addressed)? I'll help with this one if you won't be able to do that
Hi @sabilmakbar, thanks for offering! It'll be great if you can take over.
Hi @Gyyz, I would like to let you know that we plan to finalize the calculation of the open contributions (e.g., dataloader implementations) in 31 hours, so it'd be great if we could wrap up the reviewing and merge this PR before then.
cc: @muhammadravi251001 @sabilmakbar
After adding
encoding
parameter, still for thesesubsets
--bzq
,wmm
,ntx
,zlm
it throws these errors:FileNotFoundError: Couldn't find file at https://raw.githubusercontent.com/ehsanasgari/UniSent/master/unisent_lexica_v1/bzq_unisent_lexicon.txt FileNotFoundError: Couldn't find file at https://raw.githubusercontent.com/ehsanasgari/UniSent/master/unisent_lexica_v1/wmm_unisent_lexicon.txt FileNotFoundError: Couldn't find file at https://raw.githubusercontent.com/ehsanasgari/UniSent/master/unisent_lexica_v1/ntx_unisent_lexicon.txt FileNotFoundError: Couldn't find file at https://raw.githubusercontent.com/ehsanasgari/UniSent/master/unisent_lexica_v1/zlm_unisent_lexicon.txt
I guess the
_LANGUAGE
naming for thosesubsets
is still unaligned, you need to find the right subset for those particular languages. Find the language name on here.
Yes, not sure about the language name list, I got the list from #626, this is a list focus on South-east Asia Language, I think. I will make an intersect on both language name list and make another commit.
After adding
encoding
parameter, still for thesesubsets
--bzq
,wmm
,ntx
,zlm
it throws these errors:FileNotFoundError: Couldn't find file at https://raw.githubusercontent.com/ehsanasgari/UniSent/master/unisent_lexica_v1/bzq_unisent_lexicon.txt FileNotFoundError: Couldn't find file at https://raw.githubusercontent.com/ehsanasgari/UniSent/master/unisent_lexica_v1/wmm_unisent_lexicon.txt FileNotFoundError: Couldn't find file at https://raw.githubusercontent.com/ehsanasgari/UniSent/master/unisent_lexica_v1/ntx_unisent_lexicon.txt FileNotFoundError: Couldn't find file at https://raw.githubusercontent.com/ehsanasgari/UniSent/master/unisent_lexica_v1/zlm_unisent_lexicon.txt
I guess the
_LANGUAGE
naming for thosesubsets
is still unaligned, you need to find the right subset for those particular languages. Find the language name on here.Yes, not sure about the language name list, I got the list from #626, this is a list focus on South-east Asia Language, I think. I will make an intersect on both language name list and make another commit.
Yes, #626 has extra four language name.
After adding
encoding
parameter, still for thesesubsets
--bzq
,wmm
,ntx
,zlm
it throws these errors:FileNotFoundError: Couldn't find file at https://raw.githubusercontent.com/ehsanasgari/UniSent/master/unisent_lexica_v1/bzq_unisent_lexicon.txt FileNotFoundError: Couldn't find file at https://raw.githubusercontent.com/ehsanasgari/UniSent/master/unisent_lexica_v1/wmm_unisent_lexicon.txt FileNotFoundError: Couldn't find file at https://raw.githubusercontent.com/ehsanasgari/UniSent/master/unisent_lexica_v1/ntx_unisent_lexicon.txt FileNotFoundError: Couldn't find file at https://raw.githubusercontent.com/ehsanasgari/UniSent/master/unisent_lexica_v1/zlm_unisent_lexicon.txt
I guess the
_LANGUAGE
naming for thosesubsets
is still unaligned, you need to find the right subset for those particular languages. Find the language name on here.
@muhammadravi251001 @sabilmakbar @holylovenia problem fixed. Please check the latest commit.🎇🎇🎇
Is there any other feedback from you kak @sabilmakbar?
Closes #626
Checkbox
seacrowd/sea_datasets/{my_dataset}/{my_dataset}.py
(please use only lowercase and underscore for dataset folder naming, as mentioned in dataset issue) and its__init__.py
within{my_dataset}
folder._CITATION
,_DATASETNAME
,_DESCRIPTION
,_HOMEPAGE
,_LICENSE
,_LOCAL
,_URLs
,_SUPPORTED_TASKS
,_SOURCE_VERSION
, and_SEACROWD_VERSION
variables._info()
,_split_generators()
and_generate_examples()
in dataloader script.BUILDER_CONFIGS
class attribute is a list with at least oneSEACrowdConfig
for the source schema and one for a seacrowd schema.datasets.load_dataset
function.python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py
orpython -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py --subset_id {subset_name_without_source_or_seacrowd_suffix}
.