SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
68 stars 57 forks source link

Create dataset loader for SREDFM #48

Closed SamuelCahyawijaya closed 6 months ago

SamuelCahyawijaya commented 1 year ago

Dataloader name: sredfm/sredfm.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?sredfm

Dataset sredfm
Description SREDFM is an automatically annotated dataset for relation extraction task covering 18 languages, 400 relation types, 13 entity types, totaling more than 40 million triplet instances. SREDFM includes Vietnamnese.
Subsets SREDFM_vi
Languages vie
Tasks Relation Extraction
License Creative Commons Attribution Share Alike 4.0 (cc-by-sa-4.0)
Homepage https://github.com/babelscape/rebel
HF URL https://huggingface.co/datasets/Babelscape/SREDFM
Paper URL https://aclanthology.org/2023.acl-long.237/
sabilmakbar commented 12 months ago

This is interesting, I'll take this first and see whether this can be done under a week, else I might release the task to others.

Btw, @SamuelCahyawijaya do we have a increasing bonus system after some time if the issue hasn't been picked up by anyone?

sabilmakbar commented 12 months ago

self-assign

sabilmakbar commented 11 months ago

I'll start on this once #62 has been reviewed. Will start it by next week.

github-actions[bot] commented 11 months ago

Hi, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

sabilmakbar commented 11 months ago

Will try to get this done by EoW (since my other dataloader has been merged recently)

github-actions[bot] commented 10 months ago

Hi, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

sabilmakbar commented 10 months ago

Ugh, didn't have the chance to do this. will release this and see if anyone else can take this instead.

khelli07 commented 10 months ago

self-assign

github-actions[bot] commented 10 months ago

Hi, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

khelli07 commented 10 months ago

Hi, I did worked on this. The dataloader works, but apparently the test are hard to get passed, especially the seacrowd schema one. IIRC, I was having issues with IDs (duplicate). Currently have no time yet to fix it.

github-actions[bot] commented 9 months ago

Hi @, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

khelli07 commented 9 months ago

Currently discussing it with @SamuelCahyawijaya

khelli07 commented 9 months ago

Basically I got two problems 1) Spacing problem (the text and text offset asserted not equal. The content is the same, but the spacing is different) image

2) Id uniqueness problem image

For second problem, I am still not sure why because:

holylovenia commented 8 months ago

Basically I got two problems

  1. Spacing problem (the text and text offset asserted not equal. The content is the same, but the spacing is different) image
  2. Id uniqueness problem image

For second problem, I am still not sure why because:

  • For source schema, I assume it only checks the yielded id (the yield ..., { ... } at the example loop)
  • So for ID uniqueness, it is most likely coming from the seacrowd shema, BUT assuming that this is the ids the test check "id": example["docid"], --> skip the same doc id "passages": passages, --> use custom id (counter) "entities": entities, --> counter "relations": relations, --> counter

Can look at the current full code here

Hi @khelli07, are you still discussing it with @SamuelCahyawijaya or do you guys need another pair of eyes?

khelli07 commented 8 months ago

I think @SamuelCahyawijaya is currently busy. Might need another help.

sabilmakbar commented 8 months ago

For 1, I believe the text_offset was generated from the text field, but I saw on the offset text that a new line isn't present in the original text. Is it expected?

For 2, when I rechecked the schema, it checked the IDs defined on the example level to ensure they were unique. In your implementation, you're defining the entity & relation ID only by specifying an iteration counter, which leads to duplication on the check. Perhaps the workaround will be similar to what you did on the passage ID, appending it to some unique identifier for distinguishing entity ID and relation ID.

khelli07 commented 8 months ago

OK, will take a look into this again in near time.

khelli07 commented 8 months ago

For number 2) it is resolved now. I think I misunderstood the concept id at first.

Now, for problem number 1) -> yes, the problem is in the newline. Content-wise is the same

khelli07 commented 8 months ago

image omg, I passed the tests 😂

The problem actually lies in the real dataset. The entities from the source datasets have "tidy" whitespaces (they don't have newlines). If you think about it, if the real passages are messy, the entity taken from it should also be messy.

That being said, I suspect the source dataset does not actually take the entity from the passage cause the passage is a bit chaotic (in terms of whitespaces).

holylovenia commented 8 months ago

image omg, I passed the tests 😂

The problem actually lies in the real dataset. The entities from the source datasets have "tidy" whitespaces (they don't have newlines). If you think about it, if the real passages are messy, the entity taken from it should also be messy.

That being said, I suspect the source dataset does not actually take the entity from the passage cause the passage is a bit chaotic (in terms of whitespaces).

Awesome, hahaha. Glad to hear that you've resolved the issue! 👍