MIT-LCP / mimic-code

MIMIC Code Repository: Code shared by the research community for the MIMIC family of databases
https://mimic.mit.edu
MIT License
2.56k stars 1.51k forks source link

SQLite import for mimic3 gives mixed column type warning #1237

Open armando-fandango opened 2 years ago

armando-fandango commented 2 years ago

Prerequisites

Description

While trying to import mimic3 into SQLite with import.py, I get the following error:

Starting processing DATETIMEEVENTS.csv.gz
mimic-code/mimic-iii/buildmimic/sqlite/import.py:25: DtypeWarning: Columns (13) have mixed types. Specify dtype option on import or set low_memory=False.
  for chunk in pd.read_csv(f, index_col="ROW_ID", chunksize=CHUNKSIZE):
...
Starting processing INPUTEVENTS_CV.csv.gz
/home/armando/projects/mimic-code/mimic-iii/buildmimic/sqlite/import.py:25: DtypeWarning: Columns (20,21) have mixed types. Specify dtype option on import or set low_memory=False.
  for chunk in pd.read_csv(f, index_col="ROW_ID", chunksize=CHUNKSIZE):
...
Starting processing NOTEEVENTS.csv.gz
/home/armando/projects/mimic-code/mimic-iii/buildmimic/sqlite/import.py:25: DtypeWarning: Columns (4,5) have mixed types. Specify dtype option on import or set low_memory=False.
  for chunk in pd.read_csv(f, index_col="ROW_ID", chunksize=CHUNKSIZE):
...
Starting processing CHARTEVENTS.csv.gz
/home/armando/projects/mimic-code/mimic-iii/buildmimic/sqlite/import.py:25: DtypeWarning: Columns (13) have mixed types. Specify dtype option on import or set low_memory=False.
  for chunk in pd.read_csv(f, index_col="ROW_ID", chunksize=CHUNKSIZE):
...
pshuwei commented 1 year ago

Hi, I also am running the import.py code and I ran into the same problem...

Did you manage to figure it out or find an alternative solution?

alistairewj commented 1 year ago

It's not strictly an error but it may result in an inconsistent data load (I haven't checked). Essentially the load uses pandas as a convenience. pandas tries a low memory load, fails, and reverts to a high memory load. It can be fixed by specifying the known data types for each table in the read_csv call.

armando-fandango commented 1 year ago

Since the column types are already known in advance and are not going to change since its a frozen/snapshot dataset, hence would it be good to add the column type to the import script? I can send a pull request if this solution is acceptable.

alistairewj commented 1 year ago

Yes it would for sure, and yes we would love a PR!