Closed sg-wbi closed 1 year ago
Great! Thank you. I randomly checked several documents of the data set and they were all valid.
Running the test scripts highlights some offset issues. Of course, there are very few of them, but for the BigBio scheme we should strive for the best possible quality (which is achievable with reasonable effort) and improve these simple mistakes. Here, the issue is that nearly all incorrect annotations just include a whitespace or point+whitespace at the end.
WARNING:__main__:
Example:49 - entity:51 text:`lens capsule` != text_by_offset:`lens capsule.`
Example:32873 - entity:32885 text:`hippocampal neurons` != text_by_offset:`hippocampal neurons.`
Example:32873 - entity:32888 text:`hippocampal neurons` != text_by_offset:`hippocampal neurons `
Example:99560 - entity:99562 text:`cortical ER` != text_by_offset:`cortical ER `
[...]
@mariosaenger thanks for checking this out. I added the fixes to the offsets in the bigbio view.
This is great @mariosaenger @sg-wbi, thanks for updating!
This fixes loading data from
annotations.csv
. Due to an error in creating a nested dictionary only the annotations for the last figure caption were loaded.@davidkartchner since you expressed intertest for this dataset, please make sure to use this updated version when it gets merged!
Hats off to @mariosaenger for catching this!
I tested correctness against
pmcid
4772957 andfigure
Figure_1-G.