clulab / reach

Reach Biomedical Information Extraction
Other
97 stars 39 forks source link

Mixed-up fields in CMU output #556

Open adarshp opened 6 years ago

adarshp commented 6 years ago

While looking through some REACH output exported in the CMU format, I came across entries that have the fields mixed up. I’ve attached a (non-exhaustive) list of them with this post. They do not comprise a large fraction of the data, but I thought it might be good to bring it to your attention nonetheless. The file ‘MalformedEntries.txt’ has the entries that have wrong values for the fields “Database Name”, “PosReg Type”, and “NegReg Type”. I am not sure whether the bug is in the reader or exporter part of the codebase, but I'll try to take a crack at fixing it.

MalformedEntries.txt

MihaiSurdeanu commented 6 years ago

Thank you!

@hickst: when you're back, can you please take a look?

adarshp commented 6 years ago

No problem. It seems that for the first two documents in the list, it seems that the malformed entries come from a section called 'List of abbreviations used':

https://www.ncbi.nlm.nih.gov/pmc/oai/oai.cgi?verb=GetRecord&identifier=oai:pubmedcentral.nih.gov:1198222&metadataPrefix=pmc

https://www.ncbi.nlm.nih.gov/pmc/oai/oai.cgi?verb=GetRecord&identifier=oai:pubmedcentral.nih.gov:1382219&metadataPrefix=pmc

Also I realized I didn't put the header row in the MalformedEntries.txt file earlier, here is an updated version:

MalformedEntries.txt

adarshp commented 6 years ago

Here is the tsv output from the REACH web service (http://agathon.sista.arizona.edu:8080/odinweb/uploader) processing the paper with PMID 1198222.

PMID 1198222 REACH tsv output