SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
65 stars 57 forks source link

Create dataset loader for ETOS #354

Closed SamuelCahyawijaya closed 6 months ago

SamuelCahyawijaya commented 8 months ago

Dataloader name: etos/etos.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?etos

Dataset etos
Description ETOS (Ejaan oTOmatiS) is a dataset for automatic spelling correction for formal Indonesian text. It consists of 200 sentences with each sentence contains at least one typo. It has 4,323 tokens with 288 of them are non-word errors.
Subsets -
Languages ind
Tasks POS Tagging
License GNU Affero General Public License v3.0 (agpl-3.0)
Homepage https://github.com/ir-nlp-csui/etos
HF URL -
Paper URL https://ieeexplore.ieee.org/document/10053062
raileymontalan commented 8 months ago

self-assign

raileymontalan commented 8 months ago

Hi @holylovenia @SamuelCahyawijaya, so this dataset's task is supposed to be for Error Spelling Correction. We currently don't have that as an existing Task. Another concern: the data provided doesn't even (1) specify which tokens are incorrectly spelled, and (2) does not contain any corrections for these typos. What would you recommend to do for this? Thanks!

holylovenia commented 8 months ago

https://github.com/ir-nlp-csui/etos

I tried manually checking the data and arrived at the same conclusion as @raileymontalan's, i.e., the data do not specify nor fix the typos.

In this state, as it doesn't have sufficient annotations, I doubt the dataset would be useful. Should we consider 1) removing this dataloader issue and the datasheet, or 2) just implementing the source schema (no need for seacrowd schema)?

cc: @SamuelCahyawijaya @sabilmakbar

raileymontalan commented 8 months ago

Hi @SamuelCahyawijaya @sabilmakbar any thoughts on @holylovenia's suggestions? It's looking like option 2 (remove the seacrowd schema) is preferable here.

holylovenia commented 8 months ago

Hi @SamuelCahyawijaya @sabilmakbar any thoughts on @holylovenia's suggestions? It's looking like #2 (remove the seacrowd schema) is preferable here.

Let's wait for some time for their reply. 🙏 All SEACrowd reviewers are full-time employees/students, so they are typically more available on weekends or during their time off.

SamuelCahyawijaya commented 8 months ago

@raileymontalan @holylovenia, sorry for the delay, I missed the notification for this issue.

yes, agree with both of you. I think for this kind of dataset with no clear downstream task we can just implement the source schema. I will label the issue as source-only.

holylovenia commented 8 months ago

https://github.com/ir-nlp-csui/etos

I tried manually checking the data and arrived at the same conclusion as @raileymontalan's, i.e., the data do not specify nor fix the typos.

In this state, as it doesn't have sufficient annotations, I doubt the dataset would be useful. Should we consider 1) removing this dataloader issue and the datasheet, or 2) just implementing the source schema (no need for seacrowd schema)?

@SamuelCahyawijaya It's not unclear downstream task. It doesn't have the appropriate annotations, hence my no. 1 suggestion.

SamuelCahyawijaya commented 8 months ago

@holylovenia: Let me clear up the definition of unclear downstream task first.

In my opinion, the dataset is well-annotated since it uses a standard conllu format with annotated POS tags. But because the paper mentions that the dataset is for automatic spelling correction, I think the existing annotation is not helpful (I don't have access to the paper, so I cannot confirm what kind of experiment is conducted in the paper) and instead it's rather confusing why the POS annotation is there, hence I called it as unclear downstream task.

In my opinion, the data looks good, it comes from a reputable source, and it also has a publication, so in this case, I am not sure why we need to remove the dataloader. If we can confirm the task that is done in the paper, we can perhaps follow their experiment (e.g., if they do POS tagging then we can just change the task into POS tagging).

holylovenia commented 8 months ago

@holylovenia: Let me clear up the definition of unclear downstream task first.

In my opinion, the dataset is well-annotated since it uses a standard conllu format with annotated POS tags. But because the paper mentions that the dataset is for automatic spelling correction, I think the existing annotation is not helpful (I don't have access to the paper, so I cannot confirm what kind of experiment is conducted in the paper) and instead it's rather confusing why the POS annotation is there, hence I called it as unclear downstream task.

In my opinion, the data looks good, it comes from a reputable source, and it also has a publication, so in this case, I am not sure why we need to remove the dataloader. If we can confirm the task that is done in the paper, we can perhaps follow their experiment (e.g., if they do POS tagging then we can just change the task into POS tagging).

Yeah sure, I agree that converting the task to POS tagging is much better than completely not using it. What do you think, @raileymontalan?

raileymontalan commented 8 months ago

@holylovenia yup can convert this to a POS tagging task instead. Thanks!

github-actions[bot] commented 7 months ago

Hi @, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

raileymontalan commented 7 months ago

Pushed some changes for this dataloader to include the missing __init__.py file. This should be good for PR review. Thanks! @sabilmakbar

akhdanfadh commented 6 months ago

Coming from reviewing the dataloader, I suppose this datasheet will be updated for the POS tagging task, right? Just a reminder @SamuelCahyawijaya

holylovenia commented 6 months ago

Coming from reviewing the dataloader, I suppose this datasheet will be updated for the POS tagging task, right? Just a reminder @SamuelCahyawijaya

Yes, because the dataset does not support the annotation for error spelling correction, @akhdanfadh. Let me modify the datasheet and the issue ticket.