Closed erikyao closed 1 year ago
It turns out that this is not an error in the source data file, but in the Pandas read_csv
options we use.
The concept name of C0450304
is 0.018 inch
, written as "0.018\""
in the SemMedDB file, representing the quoted, escaped string of 0.018"
.
The OBJECT_NAME
, OBJECT_SEMTYPE
, and OBJECT_NOVELTY
, form a text input "0.018\"","qnco","1"
that Pandas should have split into 3 strings, 0.018\"
, qnco
, and 1
, only if Pandas knew the existence of the escape character \
.
However the Pandas read_csv
does not set default escapechar="\\"
so somehow (weirdly) it splits the text input into 2 strings, 0.018\",qnco"
, and 1
. (I still cannot figure out how Pandas comes to this result...)
Adding the option escapechar="\\"
should be the easy fix.
Existing invalid values as below:
<NA>
<NA>
<NA>
<NA>
<NA>
<NA>
So 2 more filters on
OBJECT_NAME
andOBJECT_NOVELTY
are needed in the parser.