Closes #113 | Create dataset loader for HSE Thai

khelli07 commented 3 months ago

Closes #113

Checkbox

[x] Confirm that this PR is linked to the dataset issue.
[x] Create the dataloader script seacrowd/sea_datasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
[x] Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _SEACROWD_VERSION variables.
[x] Implement _info(), _split_generators() and _generate_examples() in dataloader script.
[x] Make sure that the BUILDER_CONFIGS class attribute is a list with at least one SEACrowdConfig for the source schema and one for a seacrowd schema.
[x] Confirm dataloader script works with datasets.load_dataset function.
[x] Confirm that your dataloader script passes the test suite run with python -m tests.test_seacrowd seacrowd/sea_datasets/hse_thai/hse_thai.py.
[ ] If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

yongzx commented 3 months ago

I've run the test and it works. I agree with @ljvmiranda921's suggestions, and I have left further comments on the license.

holylovenia commented 2 months ago

Friendly reminder for @khelli07 to address @yongzx and @ljvmiranda921's suggestions.

khelli07 commented 2 months ago

Hmm, I downloaded the original data from the original source (http://web-corpora.net/ThaiCorpus/search/). But unfortunately I have no idea at all since I do not understand Thai. Here are some screen shoots.

List of folders:

File inside folders: (most are XML)

I can guess that <se> is sentence, <w> is word. For a language identification, I guess I just make 1 sentence 1 data row. 1 sentence is constructed by merging all the black string inside the the <w>. But yeah for this, translation and parts of speech tagging, I might have to pass this task to someone else :)

holylovenia commented 2 months ago

Hmm, I downloaded the original data from the original source (http://web-corpora.net/ThaiCorpus/search/). But unfortunately I have no idea at all since I do not understand Thai. Here are some screen shoots.

List of folders:

File inside folders: (most are XML)

I can guess that <se> is sentence, <w> is word. For a language identification, I guess I just make 1 sentence 1 data row. 1 sentence is constructed by merging all the black string inside the the <w>. But yeah for this, translation and parts of speech tagging, I might have to pass this task to someone else :)

Hmmm, let me ask our Thai contributor. Hello @mrpeerat, could you please help @khelli07 understand this dataset? 🙏

mrpeerat commented 2 months ago

Hmm, I downloaded the original data from the original source (http://web-corpora.net/ThaiCorpus/search/). But unfortunately I have no idea at all since I do not understand Thai. Here are some screen shoots. List of folders: File inside folders: (most are XML) I can guess that <se> is sentence, <w> is word. For a language identification, I guess I just make 1 sentence 1 data row. 1 sentence is constructed by merging all the black string inside the the <w>. But yeah for this, translation and parts of speech tagging, I might have to pass this task to someone else :)

Hmmm, let me ask our Thai contributor. Hello @mrpeerat, could you please help @khelli07 understand this dataset? 🙏

Hi, to construct the sentence, you can merge all the words <w> (the black word) in the same sentence (<se>), as you mentioned. For the translation, all the blue words are word translation (word-to-word translation) and should not be used as sentence translation (you can see that the meaning is usually incorrect if you concatenate all the translation words together in the same sentence). For the PoS, the PoS is designed for English, not Thai. Feel free to ask more if you have any questions :)

khelli07 commented 2 months ago

the PoS is designed for English, not Thai

Hi, can you explain what do you mean by this?

Also, so far, what I understand on what I need to do is: 1) Instead of using Kaggle source, use the original source. 2) The task is still language identification /modelling (same as the original issue) since a) the translation is not valid for translation task; and b) because of a), pos is also not valid for PoS tagging task.

Am I correct?

mrpeerat commented 2 months ago

the PoS is designed for English, not Thai

Hi, can you explain what do you mean by this?

Also, so far, what I understand on what I need to do is:

Instead of using Kaggle source, use the original source.

The task is still language identification /modelling (same as the original issue) since a) the translation is not valid for translation task; and b) because of a), pos is also not valid for PoS tagging task.

Am I correct?

I looked at the PoS and found that some of them were annotated for the translation word, not for Thai. For instance, given the word "กับ", the PoS annotated that it is possibly a noun, preposition, or conjunction. However, the dataset said "preposition". But, in Thai, the word should be "prepositional phrase". In this case, it looks like the translation and the PoS of the translation are correct. But the PoS of Thai is incorrect.

Correct. 2a. It's a word-to-word translation (source=>target translation but doing only one word at a time and omitting the semantics of the sentence). So I don't know if it is useful for 2024 or not since we have a lot of bi-lingual corpora 2b. Correct

holylovenia commented 2 months ago

Hi @khelli07, I would like to let you know that we plan to finalize the calculation of the open contributions (e.g., dataloader implementations) by 30 May, so it'd be great if we could wrap up the reviewing and merge this PR before then.

khelli07 commented 2 months ago

Okay, doing it today or tomorrow!

khelli07 commented 2 months ago

The download is soooo slow :' ) *it's not my internet, but i think its the server's upload ability (this happened to me before)

holylovenia commented 2 months ago

The download is soooo slow :' ) *it's not my internet, but i think its the server's upload ability (this happened to me before)

Hi @khelli07, did you manage to download it?

holylovenia commented 1 month ago

Hi @khelli07, I would like to let you know that we plan to finalize the calculation of the open contributions (e.g., dataloader implementations) in 31 hours, so it'd be great if we could wrap up the reviewing and merge this PR before then.

cc: @yongzx @ljvmiranda921

holylovenia commented 1 week ago

Hi @khelli07, thank you for contributing to SEACrowd! I would like to let you know that we are still looking forward to completing this PR (and dataloader issues) and maintaining SEACrowd Data Hub. We hope to enable access to as many standardized dataloaders as possible for SEA datasets. ☺️

Feel free to continue the PR whenever you're available, and if you would like to re-assign this dataloader to someone else, just let us know and we can help. 💪

Thanks again!

cc: @ljvmiranda921 @yongzx

khelli07 commented 1 week ago

Hi, I got a problem while downloading the data (the server upload capability is too low), so I decided to download it first and upload it with GIT LFS here (https://github.com/khelli07/hse-thai-for-seacrowd). It seems that the data can be redistributed (pls double check?). I'll try to finish this if this option works.

SEACrowd / seacrowd-datahub

Closes #113 | Create dataset loader for HSE Thai #557

Checkbox