invalid values in `OBJECT_NAME` and `OBJECT_NOVELTY`

PREDICATION_ID	PMID	PREDICATE	SUBJECT_CUI	SUBJECT_NAME	SUBJECT_SEMTYPE	SUBJECT_NOVELTY	OBJECT_CUI	OBJECT_NAME	OBJECT_SEMTYPE	OBJECT_NOVELTY
102593865	9860854	MEASURES	C0006779	Calorimetry	lbpr	1	C0450304	0.018\",qnco"	1	`<NA>`
124903879	15767012	MEASURES	C0427978	Minimum Inhibitory Concentration measurement	lbpr	1	C0450303	0.016\",qnco"	1	`<NA>`
135664493	21134812	MEASURES	C0013786	Electric Stimulation	lbpr	1	C0450300	0.010\",qnco"	1	`<NA>`
157862853	18701151	MEASURES	C0441633	Scanning	diap	1	C0450305	0.022\",qnco"	1	`<NA>`
162061499	25698650	MEASURES	C0849974	Pulmonary Function Test/Forced Expiratory Volume 1	diap	1	C0450305	0.022\",qnco"	1	`<NA>`
180419236	31049562	MEASURES	C0596927	microcalorimetry	lbpr	1	C0450303	0.016\",qnco"	1	`<NA>`

PREDICATION_ID

PMID

PREDICATE

SUBJECT_CUI

SUBJECT_NAME

SUBJECT_SEMTYPE

SUBJECT_NOVELTY

OBJECT_CUI

OBJECT_NAME

OBJECT_SEMTYPE

OBJECT_NOVELTY

102593865

9860854

MEASURES

C0006779

Calorimetry

lbpr

C0450304

0.018\",qnco"

<NA>

124903879

15767012

MEASURES

C0427978

Minimum Inhibitory Concentration measurement

lbpr

C0450303

0.016\",qnco"

<NA>

135664493

21134812

MEASURES

C0013786

Electric Stimulation

lbpr

C0450300

0.010\",qnco"

<NA>

157862853

18701151

MEASURES

C0441633

Scanning

diap

C0450305

0.022\",qnco"

<NA>

162061499

25698650

MEASURES

C0849974

Pulmonary Function Test/Forced Expiratory Volume 1

diap

C0450305

0.022\",qnco"

<NA>

180419236

31049562

MEASURES

C0596927

microcalorimetry

lbpr

C0450303

0.016\",qnco"

<NA>

It turns out that this is not an error in the source data file, but in the Pandas read_csv options we use.

The concept name of C0450304 is 0.018 inch, written as "0.018\"" in the SemMedDB file, representing the quoted, escaped string of 0.018".

The OBJECT_NAME, OBJECT_SEMTYPE, and OBJECT_NOVELTY, form a text input "0.018\"","qnco","1" that Pandas should have split into 3 strings, 0.018\", qnco, and 1, only if Pandas knew the existence of the escape character \.

However the Pandas read_csv does not set default escapechar="\\" so somehow (weirdly) it splits the text input into 2 strings, 0.018\",qnco", and 1. (I still cannot figure out how Pandas comes to this result...)

Adding the option escapechar="\\" should be the easy fix.

biothings / semmeddb

invalid values in `OBJECT_NAME` and `OBJECT_NOVELTY` #10