Closes #626 | Create dataset loader for UniSent #626

Gyyz commented 2 months ago

Closes #626

Checkbox

[x] Confirm that this PR is linked to the dataset issue.
[x] Create the dataloader script seacrowd/sea_datasets/{my_dataset}/{my_dataset}.py (please use only lowercase and underscore for dataset folder naming, as mentioned in dataset issue) and its __init__.py within {my_dataset} folder.
[x] Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _LOCAL, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _SEACROWD_VERSION variables.
[x] Implement _info(), _split_generators() and _generate_examples() in dataloader script.
[x] Make sure that the BUILDER_CONFIGS class attribute is a list with at least one SEACrowdConfig for the source schema and one for a seacrowd schema.
[x] Confirm dataloader script works with datasets.load_dataset function.
[x] Confirm that your dataloader script passes the test suite run with python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py or python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py --subset_id {subset_name_without_source_or_seacrowd_suffix}.
[ ] If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

Gyyz commented 2 months ago

Scripts Passed.

1. the test script

#!/bin/bash

# Define the list of language codes
LANGUAGES=(
    aaz abx ace agn agt ahk akb alj alp amk
    aoz atb atd att ban bbc bcl bgr bgs bgz
    bhp bkd bku blw blz bpr bps bru btd bth
    bto bts btx bug bvz bzi cbk ceb cfm
    cgc clu cmo cnh cnw csy ctd czt dgc dtp
    due duo ebk fil gbi gor heg hil hnj hnn
    hvn iba ifa ifb ifk ifu ify ilo ind iry
    isd itv ium ivb ivv jav jra kac khm kix
    kje kmk kne kqe krj ksc ksw kxm lao lbk
    lew lex lhi lhu ljp lus mad mak mbb mbd
    mbf mbi mbs mbt mej mkn mnb mog mqj mqy
    mrw msb msk msm mta mtg mtj mvp mwq
    mwv mya nbe nfa nia nij nlc npy obo
    pag pam plw pmf pne ppk prf prk ptu pww
    sas sbl sda sgb smk sml sun sxn szb tbl
    tby tcz tdt tgl tha tih tlb twu urk vie
    war whk wrs xbr yli yva zom zyp pse
    mnx mmn lsi hlt gdg bnj acn
)

# Loop through each language code
for lang in "${LANGUAGES[@]}"; do
    echo "Running subset_id for language: $lang"
    python -m tests.test_seacrowd seacrowd/sea_datasets/unisent/unisent.py --subset_id unisent_"$lang"
done

2. the make script:

make check_file=seacrowd/sea_datasets/unisent/unisent.py

this script would reformat the _LANGUAGES list to a unfriendly one.

holylovenia commented 1 month ago

Hi @Gyyz, I would like to let you know that we plan to finalize the calculation of the open contributions (e.g., dataloader implementations) by 30 May, so it'd be great if we could wrap up the reviewing and merge this PR before then.

Gyyz commented 1 month ago

And also, I still get the errors for testing these subset_id: agt, bru, bth, bzq, bzi, ium, ivb, jra, khm, ksw, kxm, lao, lhu, wmm, mya, ntx, pww, sml, sxn, tha, urk, vie, zlm

Please make sure that you are using my original code without modifying the below, I didn't pass the split args to my function as I didn't use this args:


    name=datasets.Split.TRAIN,
    gen_kwargs={
        "filepath": os.path.join(data_dir),
        "split": "train",
    },)

muhammadravi251001 commented 1 month ago

And also, I still get the errors for testing these subset_id: agt, bru, bth, bzq, bzi, ium, ivb, jra, khm, ksw, kxm, lao, lhu, wmm, mya, ntx, pww, sml, sxn, tha, urk, vie, zlm

Please make sure that you are using my original code without modifying the below, I didn't pass the split args to my function as I didn't use this args:
    name=datasets.Split.TRAIN,
    gen_kwargs={
        "filepath": os.path.join(data_dir),
        "split": "train",
    },)

Sure, I don't modify your code at all before testing the code. How about in your end @holylovenia, can you pass the test with these subset_id?

python -m tests.test_seacrowd seacrowd/sea_datasets/unisent/unisent.py --subset_id unisent_agt
python -m tests.test_seacrowd seacrowd/sea_datasets/unisent/unisent.py --subset_id unisent_bru
python -m tests.test_seacrowd seacrowd/sea_datasets/unisent/unisent.py --subset_id unisent_bth
python -m tests.test_seacrowd seacrowd/sea_datasets/unisent/unisent.py --subset_id unisent_bzq
python -m tests.test_seacrowd seacrowd/sea_datasets/unisent/unisent.py --subset_id unisent_bzi
python -m tests.test_seacrowd seacrowd/sea_datasets/unisent/unisent.py --subset_id unisent_ium
python -m tests.test_seacrowd seacrowd/sea_datasets/unisent/unisent.py --subset_id unisent_ivb
python -m tests.test_seacrowd seacrowd/sea_datasets/unisent/unisent.py --subset_id unisent_jra
python -m tests.test_seacrowd seacrowd/sea_datasets/unisent/unisent.py --subset_id unisent_khm
python -m tests.test_seacrowd seacrowd/sea_datasets/unisent/unisent.py --subset_id unisent_ksw
python -m tests.test_seacrowd seacrowd/sea_datasets/unisent/unisent.py --subset_id unisent_kxm
python -m tests.test_seacrowd seacrowd/sea_datasets/unisent/unisent.py --subset_id unisent_lao
python -m tests.test_seacrowd seacrowd/sea_datasets/unisent/unisent.py --subset_id unisent_lhu
python -m tests.test_seacrowd seacrowd/sea_datasets/unisent/unisent.py --subset_id unisent_wmm
python -m tests.test_seacrowd seacrowd/sea_datasets/unisent/unisent.py --subset_id unisent_mya
python -m tests.test_seacrowd seacrowd/sea_datasets/unisent/unisent.py --subset_id unisent_ntx
python -m tests.test_seacrowd seacrowd/sea_datasets/unisent/unisent.py --subset_id unisent_pww
python -m tests.test_seacrowd seacrowd/sea_datasets/unisent/unisent.py --subset_id unisent_sml
python -m tests.test_seacrowd seacrowd/sea_datasets/unisent/unisent.py --subset_id unisent_sxn
python -m tests.test_seacrowd seacrowd/sea_datasets/unisent/unisent.py --subset_id unisent_tha
python -m tests.test_seacrowd seacrowd/sea_datasets/unisent/unisent.py --subset_id unisent_urk
python -m tests.test_seacrowd seacrowd/sea_datasets/unisent/unisent.py --subset_id unisent_vie
python -m tests.test_seacrowd seacrowd/sea_datasets/unisent/unisent.py --subset_id unisent_zlm

I still get the UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 3: character maps to <undefined> and raise DatasetGenerationError("An error occurred while generating the dataset") from e datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset error.

sabilmakbar commented 1 month ago

hi @holylovenia, may I know if you could handle this PR review (since all your prev comments have been addressed)? I'll help with this one if you won't be able to do that

muhammadravi251001 commented 1 month ago

I guess you should modify the encoding parameter in the open function to:

with open(filepath, "r", encoding="utf-8") as filein:

Using this, I get an OK response from tests.seacrowd and can load the dataset perfectly with the .load_dataset() method.

The issue is that when I run this data loader implementation on my Windows laptop, it doesn't always default to utf-8 encoding; whereas on Linux and MacOS, the default is utf-8.

To prevent future-encoding-error, I suggest adding an encoding parameter to the open method, @Gyyz.

This exact same error is similar to the issue I mentioned in my comment here.

muhammadravi251001 commented 1 month ago

After adding encoding parameter, still for these subsets --bzq, wmm, ntx, zlm it throws these errors:

FileNotFoundError: Couldn't find file at https://raw.githubusercontent.com/ehsanasgari/UniSent/master/unisent_lexica_v1/bzq_unisent_lexicon.txt
FileNotFoundError: Couldn't find file at https://raw.githubusercontent.com/ehsanasgari/UniSent/master/unisent_lexica_v1/wmm_unisent_lexicon.txt
FileNotFoundError: Couldn't find file at https://raw.githubusercontent.com/ehsanasgari/UniSent/master/unisent_lexica_v1/ntx_unisent_lexicon.txt
FileNotFoundError: Couldn't find file at https://raw.githubusercontent.com/ehsanasgari/UniSent/master/unisent_lexica_v1/zlm_unisent_lexicon.txt

I guess the _LANGUAGE naming for those subsets is still unaligned, you need to find the right subset for those particular languages. Find the language name on here.

holylovenia commented 1 month ago

hi @holylovenia, may I know if you could handle this PR review (since all your prev comments have been addressed)? I'll help with this one if you won't be able to do that

Hi @sabilmakbar, thanks for offering! It'll be great if you can take over.

holylovenia commented 1 month ago

Hi @Gyyz, I would like to let you know that we plan to finalize the calculation of the open contributions (e.g., dataloader implementations) in 31 hours, so it'd be great if we could wrap up the reviewing and merge this PR before then.

cc: @muhammadravi251001 @sabilmakbar

Gyyz commented 1 month ago

After adding encoding parameter, still for these subsets --bzq, wmm, ntx, zlm it throws these errors:
FileNotFoundError: Couldn't find file at https://raw.githubusercontent.com/ehsanasgari/UniSent/master/unisent_lexica_v1/bzq_unisent_lexicon.txt
FileNotFoundError: Couldn't find file at https://raw.githubusercontent.com/ehsanasgari/UniSent/master/unisent_lexica_v1/wmm_unisent_lexicon.txt
FileNotFoundError: Couldn't find file at https://raw.githubusercontent.com/ehsanasgari/UniSent/master/unisent_lexica_v1/ntx_unisent_lexicon.txt
FileNotFoundError: Couldn't find file at https://raw.githubusercontent.com/ehsanasgari/UniSent/master/unisent_lexica_v1/zlm_unisent_lexicon.txt
I guess the _LANGUAGE naming for those subsets is still unaligned, you need to find the right subset for those particular languages. Find the language name on here.

Yes, not sure about the language name list, I got the list from #626, this is a list focus on South-east Asia Language, I think. I will make an intersect on both language name list and make another commit.

Gyyz commented 1 month ago

After adding encoding parameter, still for these subsets --bzq, wmm, ntx, zlm it throws these errors:
FileNotFoundError: Couldn't find file at https://raw.githubusercontent.com/ehsanasgari/UniSent/master/unisent_lexica_v1/bzq_unisent_lexicon.txt
FileNotFoundError: Couldn't find file at https://raw.githubusercontent.com/ehsanasgari/UniSent/master/unisent_lexica_v1/wmm_unisent_lexicon.txt
FileNotFoundError: Couldn't find file at https://raw.githubusercontent.com/ehsanasgari/UniSent/master/unisent_lexica_v1/ntx_unisent_lexicon.txt
FileNotFoundError: Couldn't find file at https://raw.githubusercontent.com/ehsanasgari/UniSent/master/unisent_lexica_v1/zlm_unisent_lexicon.txt
I guess the _LANGUAGE naming for those subsets is still unaligned, you need to find the right subset for those particular languages. Find the language name on here.
Yes, not sure about the language name list, I got the list from #626, this is a list focus on South-east Asia Language, I think. I will make an intersect on both language name list and make another commit.

Yes, #626 has extra four language name. 屏幕截图 2024-05-30 130325

Gyyz commented 1 month ago

After adding encoding parameter, still for these subsets --bzq, wmm, ntx, zlm it throws these errors:
FileNotFoundError: Couldn't find file at https://raw.githubusercontent.com/ehsanasgari/UniSent/master/unisent_lexica_v1/bzq_unisent_lexicon.txt
FileNotFoundError: Couldn't find file at https://raw.githubusercontent.com/ehsanasgari/UniSent/master/unisent_lexica_v1/wmm_unisent_lexicon.txt
FileNotFoundError: Couldn't find file at https://raw.githubusercontent.com/ehsanasgari/UniSent/master/unisent_lexica_v1/ntx_unisent_lexicon.txt
FileNotFoundError: Couldn't find file at https://raw.githubusercontent.com/ehsanasgari/UniSent/master/unisent_lexica_v1/zlm_unisent_lexicon.txt
I guess the _LANGUAGE naming for those subsets is still unaligned, you need to find the right subset for those particular languages. Find the language name on here.

@muhammadravi251001 @sabilmakbar @holylovenia problem fixed. Please check the latest commit.🎇🎇🎇

muhammadravi251001 commented 1 month ago

Is there any other feedback from you kak @sabilmakbar?

SEACrowd / seacrowd-datahub