Be explicit about the datatypes of each column in csv files

ablack3 commented 1 week ago

We have Eunomia CDM datasets stored in csv files. Currently the datatype of each column is not explicitly specified when reading in the data from csv which is causing #65.

In this PR I'm using the specification in the CommonDataModel package to be explicit about the datatypes when we read the csv files which should fix the issue. However this does mean that the column order matters.

I'm not sure if we consider column order (first, second, ect) part of the CDM specification but I noticed that in the GiBleed dataset the column order does not match the order in CommonDataModel specification csv. We can work around it and/or fix the file. It's a bit more tricky if we want to allow columns to be in any order but possible.

ablack3 commented 1 week ago

I need to investigate and fix the failing tests.

fdefalco commented 1 week ago

Thanks for looking into this, another reason the duckdb based data examples are a nice direction to go in.

fdefalco commented 1 week ago

For the column order, I would suggest that the data files should match the order of the columns defined by the CDM specification, so would we rather update the data files to follow that column order as a fix?

ablack3 commented 1 week ago

For the column order, I would suggest that the data files should match the order of the columns defined by the CDM specification, so would we rather update the data files to follow that column order as a fix?

That would be my preference as well. So we require csv files to have columns in the same order specified by the CommonDataModel specification.

OHDSI / Eunomia

Be explicit about the datatypes of each column in csv files #68