Closed SamuelCahyawijaya closed 6 months ago
This is interesting, I'll take this first and see whether this can be done under a week, else I might release the task to others.
Btw, @SamuelCahyawijaya do we have a increasing bonus system after some time if the issue hasn't been picked up by anyone?
I'll start on this once #62 has been reviewed. Will start it by next week.
Hi, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.
Will try to get this done by EoW (since my other dataloader has been merged recently)
Hi, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.
Ugh, didn't have the chance to do this. will release this and see if anyone else can take this instead.
Hi, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.
Hi, I did worked on this. The dataloader works, but apparently the test are hard to get passed, especially the seacrowd schema one. IIRC, I was having issues with IDs (duplicate). Currently have no time yet to fix it.
Hi @, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.
Currently discussing it with @SamuelCahyawijaya
Basically I got two problems 1) Spacing problem (the text and text offset asserted not equal. The content is the same, but the spacing is different)
2) Id uniqueness problem
For second problem, I am still not sure why because:
yielded id
(the yield ..., { ... }
at the example loop)So for ID uniqueness, it is most likely coming from the seacrowd shema, BUT assuming that this is the ids the test check "id": example["docid"], --> skip the same doc id "passages": passages, --> use custom id (counter) "entities": entities, --> counter "relations": relations, --> counter
Can look at the current full code here
Basically I got two problems
- Spacing problem (the text and text offset asserted not equal. The content is the same, but the spacing is different)
- Id uniqueness problem
For second problem, I am still not sure why because:
- For source schema, I assume it only checks the
yielded id
(theyield ..., { ... }
at the example loop)- So for ID uniqueness, it is most likely coming from the seacrowd shema, BUT assuming that this is the ids the test check "id": example["docid"], --> skip the same doc id "passages": passages, --> use custom id (counter) "entities": entities, --> counter "relations": relations, --> counter
Can look at the current full code here
Hi @khelli07, are you still discussing it with @SamuelCahyawijaya or do you guys need another pair of eyes?
I think @SamuelCahyawijaya is currently busy. Might need another help.
For 1, I believe the text_offset was generated from the text field, but I saw on the offset text that a new line isn't present in the original text. Is it expected?
For 2, when I rechecked the schema, it checked the IDs defined on the example level to ensure they were unique. In your implementation, you're defining the entity & relation ID only by specifying an iteration counter, which leads to duplication on the check. Perhaps the workaround will be similar to what you did on the passage ID, appending it to some unique identifier for distinguishing entity ID and relation ID.
OK, will take a look into this again in near time.
For number 2) it is resolved now. I think I misunderstood the concept id
at first.
Now, for problem number 1) -> yes, the problem is in the newline. Content-wise is the same
omg, I passed the tests 😂
The problem actually lies in the real dataset. The entities from the source datasets have "tidy" whitespaces (they don't have newlines). If you think about it, if the real passages are messy, the entity taken from it should also be messy.
That being said, I suspect the source dataset does not actually take the entity from the passage cause the passage is a bit chaotic (in terms of whitespaces).
omg, I passed the tests 😂
The problem actually lies in the real dataset. The entities from the source datasets have "tidy" whitespaces (they don't have newlines). If you think about it, if the real passages are messy, the entity taken from it should also be messy.
That being said, I suspect the source dataset does not actually take the entity from the passage cause the passage is a bit chaotic (in terms of whitespaces).
Awesome, hahaha. Glad to hear that you've resolved the issue! 👍
Dataloader name:
sredfm/sredfm.py
DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?sredfm