Open khelli07 opened 3 months ago
I've run the test and it works. I agree with @ljvmiranda921's suggestions, and I have left further comments on the license.
Friendly reminder for @khelli07 to address @yongzx and @ljvmiranda921's suggestions.
Hmm, I downloaded the original data from the original source (http://web-corpora.net/ThaiCorpus/search/). But unfortunately I have no idea at all since I do not understand Thai. Here are some screen shoots.
List of folders:
File inside folders: (most are XML)
I can guess that <se>
is sentence, <w>
is word. For a language identification, I guess I just make 1 sentence 1 data row. 1 sentence is constructed by merging all the black string inside the the <w>
. But yeah for this, translation and parts of speech tagging
, I might have to pass this task to someone else :)
Hmm, I downloaded the original data from the original source (http://web-corpora.net/ThaiCorpus/search/). But unfortunately I have no idea at all since I do not understand Thai. Here are some screen shoots.
List of folders:
File inside folders: (most are XML)
I can guess that
<se>
is sentence,<w>
is word. For a language identification, I guess I just make 1 sentence 1 data row. 1 sentence is constructed by merging all the black string inside the the<w>
. But yeah for this,translation and parts of speech tagging
, I might have to pass this task to someone else :)
Hmmm, let me ask our Thai contributor. Hello @mrpeerat, could you please help @khelli07 understand this dataset? 🙏
Hmm, I downloaded the original data from the original source (http://web-corpora.net/ThaiCorpus/search/). But unfortunately I have no idea at all since I do not understand Thai. Here are some screen shoots. List of folders:
File inside folders: (most are XML)
I can guess that
<se>
is sentence,<w>
is word. For a language identification, I guess I just make 1 sentence 1 data row. 1 sentence is constructed by merging all the black string inside the the<w>
. But yeah for this,translation and parts of speech tagging
, I might have to pass this task to someone else :)Hmmm, let me ask our Thai contributor. Hello @mrpeerat, could you please help @khelli07 understand this dataset? 🙏
Hi, to construct the sentence, you can merge all the words <w>
(the black word) in the same sentence (<se>
), as you mentioned. For the translation, all the blue words are word translation (word-to-word translation) and should not be used as sentence translation (you can see that the meaning is usually incorrect if you concatenate all the translation words together in the same sentence). For the PoS, the PoS is designed for English, not Thai. Feel free to ask more if you have any questions :)
the PoS is designed for English, not Thai
Hi, can you explain what do you mean by this?
Also, so far, what I understand on what I need to do is: 1) Instead of using Kaggle source, use the original source. 2) The task is still language identification /modelling (same as the original issue) since a) the translation is not valid for translation task; and b) because of a), pos is also not valid for PoS tagging task.
Am I correct?
the PoS is designed for English, not Thai
Hi, can you explain what do you mean by this?
Also, so far, what I understand on what I need to do is:
- Instead of using Kaggle source, use the original source.
- The task is still language identification /modelling (same as the original issue) since a) the translation is not valid for translation task; and b) because of a), pos is also not valid for PoS tagging task.
Am I correct?
I looked at the PoS and found that some of them were annotated for the translation word, not for Thai. For instance, given the word "กับ", the PoS annotated that it is possibly a noun, preposition, or conjunction. However, the dataset said "preposition". But, in Thai, the word should be "prepositional phrase". In this case, it looks like the translation and the PoS of the translation are correct. But the PoS of Thai is incorrect.
Hi @khelli07, I would like to let you know that we plan to finalize the calculation of the open contributions (e.g., dataloader implementations) by 30 May, so it'd be great if we could wrap up the reviewing and merge this PR before then.
Okay, doing it today or tomorrow!
The download is soooo slow :' ) *it's not my internet, but i think its the server's upload ability (this happened to me before)
The download is soooo slow :' ) *it's not my internet, but i think its the server's upload ability (this happened to me before)
Hi @khelli07, did you manage to download it?
Hi @khelli07, I would like to let you know that we plan to finalize the calculation of the open contributions (e.g., dataloader implementations) in 31 hours, so it'd be great if we could wrap up the reviewing and merge this PR before then.
cc: @yongzx @ljvmiranda921
Hi @khelli07, thank you for contributing to SEACrowd! I would like to let you know that we are still looking forward to completing this PR (and dataloader issues) and maintaining SEACrowd Data Hub. We hope to enable access to as many standardized dataloaders as possible for SEA datasets. ☺️
Feel free to continue the PR whenever you're available, and if you would like to re-assign this dataloader to someone else, just let us know and we can help. 💪
Thanks again!
cc: @ljvmiranda921 @yongzx
Hi, I got a problem while downloading the data (the server upload capability is too low), so I decided to download it first and upload it with GIT LFS here (https://github.com/khelli07/hse-thai-for-seacrowd). It seems that the data can be redistributed (pls double check?). I'll try to finish this if this option works.
Closes #113
Checkbox
seacrowd/sea_datasets/my_dataset/my_dataset.py
(please use only lowercase and underscore for dataset naming)._CITATION
,_DATASETNAME
,_DESCRIPTION
,_HOMEPAGE
,_LICENSE
,_URLs
,_SUPPORTED_TASKS
,_SOURCE_VERSION
, and_SEACROWD_VERSION
variables._info()
,_split_generators()
and_generate_examples()
in dataloader script.BUILDER_CONFIGS
class attribute is a list with at least oneSEACrowdConfig
for the source schema and one for a seacrowd schema.datasets.load_dataset
function.python -m tests.test_seacrowd seacrowd/sea_datasets/hse_thai/hse_thai.py
.