biothings / semmeddb

1 stars 1 forks source link

invalid values in `OBJECT_NAME` and `OBJECT_NOVELTY` #10

Closed erikyao closed 1 year ago

erikyao commented 1 year ago

Existing invalid values as below:

PREDICATION_ID PMID PREDICATE SUBJECT_CUI SUBJECT_NAME SUBJECT_SEMTYPE SUBJECT_NOVELTY OBJECT_CUI OBJECT_NAME OBJECT_SEMTYPE OBJECT_NOVELTY
102593865 9860854 MEASURES C0006779 Calorimetry lbpr 1 C0450304 0.018\",qnco" 1 <NA>
124903879 15767012 MEASURES C0427978 Minimum Inhibitory Concentration measurement lbpr 1 C0450303 0.016\",qnco" 1 <NA>
135664493 21134812 MEASURES C0013786 Electric Stimulation lbpr 1 C0450300 0.010\",qnco" 1 <NA>
157862853 18701151 MEASURES C0441633 Scanning diap 1 C0450305 0.022\",qnco" 1 <NA>
162061499 25698650 MEASURES C0849974 Pulmonary Function Test/Forced Expiratory Volume 1 diap 1 C0450305 0.022\",qnco" 1 <NA>
180419236 31049562 MEASURES C0596927 microcalorimetry lbpr 1 C0450303 0.016\",qnco" 1 <NA>

So 2 more filters on OBJECT_NAME and OBJECT_NOVELTY are needed in the parser.

erikyao commented 1 year ago

It turns out that this is not an error in the source data file, but in the Pandas read_csv options we use.

The concept name of C0450304 is 0.018 inch, written as "0.018\"" in the SemMedDB file, representing the quoted, escaped string of 0.018".

The OBJECT_NAME, OBJECT_SEMTYPE, and OBJECT_NOVELTY, form a text input "0.018\"","qnco","1" that Pandas should have split into 3 strings, 0.018\", qnco, and 1, only if Pandas knew the existence of the escape character \.

However the Pandas read_csv does not set default escapechar="\\" so somehow (weirdly) it splits the text input into 2 strings, 0.018\",qnco", and 1. (I still cannot figure out how Pandas comes to this result...)

Adding the option escapechar="\\" should be the easy fix.