MIT-LCP / mimic-code

MIMIC Code Repository: Code shared by the research community for the MIMIC family of databases
https://mimic.mit.edu
MIT License
2.42k stars 1.5k forks source link

MonetDB import issues - chartevents.value - backslashes in the text #65

Closed parisni closed 7 years ago

parisni commented 8 years ago

Hello,

Some "chartevents.value" values contains backslashes. (eg: rowid = 43077585; itemid= 807; value= "\113"; I haven't count all occurences) Some database (at least monetDB) must double backslashes in order to bulk load data. Moreover this seems to disturb "chartevents.valuenum" to be computed at least for itemid 807 that need numerical values.

For now my workaround is to remove backslashes from csv thanks to "sed -i 's/\//g' CHARTEVENTS.csv" command.

Thanks,

alistairewj commented 8 years ago

I had seen some of these rows. When I checked, it looked like sometimes "\20" meant "Over 20". That's why we didn't convert them to VALUENUM across the rows.

We could easily make an export where we replace all "\" with "\". Would that fix the import for MonetDB, and maintain the data as it was originally?

parisni commented 8 years ago

Look, github too has a strange behavior with backslash. (you just said replace all "\" with "\") You are meaning escaping backslashes :o ? This would fix the problem with monetDB. However, I am not sure what the implication for all the other kind. Maybe you will find "\20" in some. At least postgresql will consider them as single backslash too. Anyway, escaping backslash is a good practice, and I would go for that. thanks

alistairewj commented 8 years ago

Right now we only escape quotes with double quotes. I think PostgreSQL expects only quotes to be escaped with double quotes, and ignores everything else. But for other database systems, we may need to escape other characters. I guess a relevant question is: are there any other characters that need to be escaped for monetdb/apache drill?

On Wed, Feb 3, 2016 at 12:14 PM, Paris Nicolas notifications@github.com wrote:

Look, github too has a strange behavior with backslash. (you just said replace all "\" with "\") You are meaning escaping backslashes :o ? This would fix the problem with monetDB. However, I am not sure what the implication for all the other kind. Maybe you will find "\20" in some. At least postgresql will consider them as single backslash too. Anyway, escaping backslash is a good practice, and I would go for that. thanks

— Reply to this email directly or view it on GitHub https://github.com/MIT-LCP/mimic-code/issues/65#issuecomment-179354407.

parisni commented 8 years ago

As I can understand reason for escaping quotes(=>csv format) & backslash(octal numbers, newlines, tabs...) , I can't see any other characters that should be escaped for monet&drill or any database system.

I come back here in a couple of hours and validate escaping backslashes is actually working for noteevent. (for now I just removed them)

parisni commented 8 years ago

It appears there is only 4 TEXTS containing unique backslashes row_id = 637741, 38125, 21886, 8168 Escaping them fixes problem for monetDB. Drill has other problems with notes, not related to this. Still working on it.

parisni commented 8 years ago

A word on that: Further investigations lead to conclude monetdb does not like \xxx where x is a digit. In chartevents those row_id do have such pattern : 6841, 8168, 10462, 21886, 26958, 35247, 38125, 384509, 637741 But only 4 of them lead to fail the monetdb parser 637741, 38125, 21886, 8168 They are composed of \325 \320 \250 \300 \234 Those row_id are working 6841,10462,26958, 35247, 384509 and are composed of \104 \106 \175 \539 \120

Then it looks like \3xx or \2xx is the pattern that fails. I guess monetdb anderstand them as octal character see http://geophysics.eas.gatech.edu/classes/Intro_GMT/gmt_www/gmt/doc/html/GMT_Docs/node160.html This means monetdb converts octal as ascii equivalent and fail when character does not exists in ascii or something related. Well I have to conclude all this is not very interesting, but backslashes should be doubled for monetdb (not for drill)

alistairewj commented 7 years ago

Hey Nicolas - hope all is well. Would be interested in your experience building MIMIC-III v1.4 on monetDB. In particular, do you know if monetDB can copy from stdin? For example, with the postgres scripts, we allow loading from compressed files as follows:

\copy CHARTEVENTS FROM PROGRAM 'gzip -dc CHARTEVENTS.csv.gz' DELIMITER ',' CSV HEADER NULL ''

I notice that in monetDB you use a very similar command to PostgreSQL's copy:

COPY 263201375 OFFSET 2 RECORDS INTO MIMICIII.CHARTEVENTS FROM '/path/to/CHARTEVENTS.csv' USING DELIMITERS ',','\n','"' NULL AS '';

The docs have this to say:

When the data is not directly accessible to the server but is accessible to the client, the 'file' in the above queries can be replaced by STDIN:

COPY INTO table FROM STDIN;

When using STDIN, the contents of the CSV file should follow the query. This form is not directly usable in ODBC or JDBC, but see SQLcopyinto.java for an example of how to use this from Java. If using the mclient tool and the CSV data is in a file by itself, you can use the command (also see the mclient manual):

mclient -d database -s "COPY INTO table FROM STDIN USING DELIMITERS ',','\\n','\"'" - < file

I'm not entirely sure how to translate this into the query that we have right now, but here is my guess:

COPY 263201375 OFFSET 2 RECORDS INTO MIMICIII.CHARTEVENTS FROM STDIN USING DELIMITERS ',','\n','"' NULL AS '' <  "sed  < 's/\/\\/g' /path/to/CHARTEVENTS.csv"

What do you think?

parisni commented 7 years ago

Hi Alister,

Postgres has its syntax "FROM STDIN" too. However it is not possible to use it from a client = send a SQL query. However I have been using this syntax within a java program (talend). "FROM PROGRAM" has a different behavior and is able to run within a SQL query.

That said, I guess the monetDB to use "FROM STDIN" would be :

mclient -d mimic -s "COPY INTO MIMICIII.CHARTEVENTS FROM STDIN USING DELIMITERS ',','\n','\"' NULL AS ''" - < sed   's/\/\\/g' /path/to/CHARTEVENTS.csv

Maybe this is better than the queries I wrote in the past, because it includes the "sed".

I will give a try a ASAP !

PS1: I am testing hadoop-hive database these days with the ORC format (https://orc.apache.org/) - very interesting format for datawarehousing, with it's ACID compliance. In order to integrate our french ICU data for sure. PS2: I am testing the postgresql Foreign Data Wrapper these days, and have been able to link hadoop-hive within postgresql. This sounds good, as an hybrid platform.

parisni commented 7 years ago

Hi Alistair,

Here is the code that worked for me :

gzip -dck /your/path/to/NOTEEVENTS.csv.gz | sed 's/\\/\\\\/g' | sed 1d | mclient -d mimic -s "COPY INTO MIMICIII.NOTEEVENTS FROM STDIN USING DELIMITERS ',','\n','\"' NULL AS ''" -

notice that :

Do you want a pull request with the modified statements?

alistairewj commented 7 years ago

Thanks for pull request #131. It's definitely better - especially because the row specification just "stopped" at that number of rows in MonetDB, which could easily result in an implicitly corrupted database. Going to close this !