OHDSI / Athena

Web application for distributing and browsing the Standardized Vocabularies for all instances of an OMOP CDM
59 stars 20 forks source link

CPT4 ULMS API process causing insertion of carriage returns in "CONCEPT.csv" #333

Open odikia opened 1 year ago

odikia commented 1 year ago

I'm presently having to clean the final concept.csv prior to insertion into a postgres database following the insertion of CPT4 codes via the cpt.bat process that is described upon downloading the vocabulary from Athena.

PostgreSQL (run in psql CLI, ):

\copy omop.concept FROM '\path\to\modified\concept.csv' WITH (FORMAT CSV, DELIMITER E'\t', QUOTE E'\b', ENCODING 'UTF8', HEADER TRUE)

Query returns:

ERROR: unquoted carriage return found in data HINT: Use quoted CSV field to represent carriage return.

System and File information

Included datafile with 4 error examples: See attached. Note that ULMS CPT4 codes being pulled down requires a license. I provide 4 error examples with Concept name and Concept code redacted so as to ensure that I haven't created any kind of license infringements by providing this document. The OMOP information provided by Odysseus, including Concept_ID's, remain.

OMOP Vocabulary version: v5.0 23-JAN-23

Java info: Version 8 Update 361 (build 1.8.0_361-b09)

Target Database version: PostgreSQL 14.4 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-12), 64-bit

System info: Processor 12th Gen Intel(R) Core(TM) i7-1270P 2.20 GHz Installed RAM 32.0 GB (31.4 GB usable) System type 64-bit operating system, x64-based processor

Windows Info: Edition Windows 10 Enterprise Version 21H2 Installed on ‎7/‎20/‎2022 OS build 19044.2846 Experience Windows Feature Experience Pack 120.2212.4190.0

CONCEPT_first_4_cpt4_errors.csv

mik-ohdsi commented 1 year ago

@odikia - Daniel, this is odd... let me double check if we have changed anything recently about the cpt4.jar.

mik-ohdsi commented 1 year ago

@odikia - looks as if we have been doing this for a while now. I can confirm that it seems that all rows for CPT4 in the concept.csv after reconstitution end with a CRLF instead of only a LF. Did you always update your vocabularies in the same way and if so, when was the last time that you were able to do so without an error?