bigscience-workshop / biomedical

Tools for curating biomedical training data for large-scale language modeling
447 stars 114 forks source link

[FIX]: bioid: load annotations #875

Closed sg-wbi closed 1 year ago

sg-wbi commented 1 year ago

This fixes loading data from annotations.csv. Due to an error in creating a nested dictionary only the annotations for the last figure caption were loaded.

@davidkartchner since you expressed intertest for this dataset, please make sure to use this updated version when it gets merged!

Hats off to @mariosaenger for catching this!

I tested correctness against pmcid 4772957 and figure Figure_1-G.

$ cat > test.txt <<EOL 
 sdPaper1440,10.15252/emmm.201505613,4772957,Figure_1-G,5,51,138,143,5,6,,β-gal,,GO:0009341,,,
sdPaper1440,10.15252/emmm.201505613,4772957,Figure_1-G,11,62,428,431,3,3,,RPE,,Uberon:UBERON:0001782,,,
sdPaper1440,10.15252/emmm.201505613,4772957,Figure_1-G,2,45,43,49,6,6,,VEGF-A,,Uniprot:Q00731,,,
sdPaper1440,10.15252/emmm.201505613,4772957,Figure_1-G,12,63,433,459,26,26,,retinal pigment epithelium,,Uberon:UBERON:0001782,,,
sdPaper1440,10.15252/emmm.201505613,4772957,Figure_1-G,10,61,420,426,6,6,,retina,,Uberon:UBERON:0000966,,,
sdPaper1440,10.15252/emmm.201505613,4772957,Figure_1-G,9,57,224,229,5,5,,F4/80,,Uniprot:Q61549,,,
sdPaper1440,10.15252/emmm.201505613,4772957,Figure_1-G,6,52,151,157,6,6,,retina,,Uberon:UBERON:0000966,,,
sdPaper1440,10.15252/emmm.201505613,4772957,Figure_1-G,16,68,357,370,13,13,,retinal cells,,CL:0009004,,,
sdPaper1440,10.15252/emmm.201505613,4772957,Figure_1-G,3,47,51,56,5,6,,β-gal,,GO:0009341,,,
sdPaper1440,10.15252/emmm.201505613,4772957,Figure_1-G,14,65,492,511,19,19,,inner nuclear layer,,Uberon:UBERON:0001791,,,
sdPaper1440,10.15252/emmm.201505613,4772957,Figure_1-G,8,55,188,193,5,5,Ahyper,mouse,,NCBI taxon:10090,,,
sdPaper1440,10.15252/emmm.201505613,4772957,Figure_1-G,1,44,3,6,3,3,,RPE,,Uberon:UBERON:0001782,,,
sdPaper1440,10.15252/emmm.201505613,4772957,Figure_1-G,7,53,177,183,6,6,,VEGF-A,hypermouse,NCBI gene:22339,,,
sdPaper1440,10.15252/emmm.201505613,4772957,Figure_1-G,4,48,86,92,6,6,,retina,,Uberon:UBERON:0000966,,,
sdPaper1440,10.15252/emmm.201505613,4772957,Figure_1-G,15,67,378,397,19,19,,inner nuclear layer,,Uberon:UBERON:0001791,,,
sdPaper1440,10.15252/emmm.201505613,4772957,Figure_1-G,13,64,466,485,19,19,,outer nuclear layer,,Uberon:UBERON:0001789,,,
EOL
$ wc -l test.txt
16 test.txt
from datasets import load_dataset
ds = load_dataset('./bigbio/hub/hub_repos/bioid/bioid.py', 'bioid_bigbio_kb')
figures_captions = [d for d in ds["train"] if d["document_id"] == "4772957"]
test_case = [f for f in figures_captions if f["passages"][0]["text"][0].startswith("G. RPE cells")][0]
assert len(test_case["entitities"])==16
mariosaenger commented 1 year ago

Great! Thank you. I randomly checked several documents of the data set and they were all valid.

Running the test scripts highlights some offset issues. Of course, there are very few of them, but for the BigBio scheme we should strive for the best possible quality (which is achievable with reasonable effort) and improve these simple mistakes. Here, the issue is that nearly all incorrect annotations just include a whitespace or point+whitespace at the end.

WARNING:__main__:
Example:49 - entity:51  text:`lens capsule` != text_by_offset:`lens capsule.`
Example:32873 - entity:32885  text:`hippocampal neurons` != text_by_offset:`hippocampal neurons.`
Example:32873 - entity:32888  text:`hippocampal neurons` != text_by_offset:`hippocampal neurons `
Example:99560 - entity:99562  text:`cortical ER` != text_by_offset:`cortical ER `

[...]
sg-wbi commented 1 year ago

@mariosaenger thanks for checking this out. I added the fixes to the offsets in the bigbio view.

davidkartchner commented 1 year ago

This is great @mariosaenger @sg-wbi, thanks for updating!