Closed SamuelCahyawijaya closed 6 months ago
Hi @holylovenia @SamuelCahyawijaya, so this dataset's task is supposed to be for Error Spelling Correction. We currently don't have that as an existing Task. Another concern: the data provided doesn't even (1) specify which tokens are incorrectly spelled, and (2) does not contain any corrections for these typos. What would you recommend to do for this? Thanks!
I tried manually checking the data and arrived at the same conclusion as @raileymontalan's, i.e., the data do not specify nor fix the typos.
In this state, as it doesn't have sufficient annotations, I doubt the dataset would be useful. Should we consider 1) removing this dataloader issue and the datasheet, or 2) just implementing the source
schema (no need for seacrowd
schema)?
cc: @SamuelCahyawijaya @sabilmakbar
Hi @SamuelCahyawijaya @sabilmakbar any thoughts on @holylovenia's suggestions? It's looking like option 2 (remove the seacrowd
schema) is preferable here.
Hi @SamuelCahyawijaya @sabilmakbar any thoughts on @holylovenia's suggestions? It's looking like #2 (remove the
seacrowd
schema) is preferable here.
Let's wait for some time for their reply. 🙏 All SEACrowd reviewers are full-time employees/students, so they are typically more available on weekends or during their time off.
@raileymontalan @holylovenia, sorry for the delay, I missed the notification for this issue.
yes, agree with both of you. I think for this kind of dataset with no clear downstream task we can just implement the source
schema. I will label the issue as source-only
.
I tried manually checking the data and arrived at the same conclusion as @raileymontalan's, i.e., the data do not specify nor fix the typos.
In this state, as it doesn't have sufficient annotations, I doubt the dataset would be useful. Should we consider 1) removing this dataloader issue and the datasheet, or 2) just implementing the
source
schema (no need forseacrowd
schema)?
@SamuelCahyawijaya It's not unclear downstream task. It doesn't have the appropriate annotations, hence my no. 1 suggestion.
@holylovenia: Let me clear up the definition of unclear downstream task
first.
In my opinion, the dataset is well-annotated since it uses a standard conllu format with annotated POS tags. But because the paper mentions that the dataset is for automatic spelling correction, I think the existing annotation is not helpful (I don't have access to the paper, so I cannot confirm what kind of experiment is conducted in the paper) and instead it's rather confusing why the POS annotation is there, hence I called it as unclear downstream task
.
In my opinion, the data looks good, it comes from a reputable source, and it also has a publication, so in this case, I am not sure why we need to remove the dataloader. If we can confirm the task that is done in the paper, we can perhaps follow their experiment (e.g., if they do POS tagging then we can just change the task into POS tagging).
@holylovenia: Let me clear up the definition of
unclear downstream task
first.In my opinion, the dataset is well-annotated since it uses a standard conllu format with annotated POS tags. But because the paper mentions that the dataset is for automatic spelling correction, I think the existing annotation is not helpful (I don't have access to the paper, so I cannot confirm what kind of experiment is conducted in the paper) and instead it's rather confusing why the POS annotation is there, hence I called it as
unclear downstream task
.In my opinion, the data looks good, it comes from a reputable source, and it also has a publication, so in this case, I am not sure why we need to remove the dataloader. If we can confirm the task that is done in the paper, we can perhaps follow their experiment (e.g., if they do POS tagging then we can just change the task into POS tagging).
Yeah sure, I agree that converting the task to POS tagging is much better than completely not using it. What do you think, @raileymontalan?
@holylovenia yup can convert this to a POS tagging task instead. Thanks!
Hi @, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.
Pushed some changes for this dataloader to include the missing __init__.py
file. This should be good for PR review. Thanks! @sabilmakbar
Coming from reviewing the dataloader, I suppose this datasheet will be updated for the POS tagging task, right? Just a reminder @SamuelCahyawijaya
Coming from reviewing the dataloader, I suppose this datasheet will be updated for the POS tagging task, right? Just a reminder @SamuelCahyawijaya
Yes, because the dataset does not support the annotation for error spelling correction, @akhdanfadh. Let me modify the datasheet and the issue ticket.
Dataloader name:
etos/etos.py
DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?etos