OHDSI / CommonDataModel

Definition and DDLs for the OMOP Common Data Model (CDM)
https://ohdsi.github.io/CommonDataModel
883 stars 450 forks source link

Time storage in CDM v5 #6

Closed wistephens closed 7 years ago

wistephens commented 9 years ago

I'm ETLing data from a MSSQL data warehouse into CDM v5 on Postgres. When I cast the time portion of a timestamp into a TIME, I'm getting strings that are greater than the varchar(10) limit on several CDM v5 tables. The max string length for time appears to be 11 characters.

CAST('11/13/2013 12:00:00 AM' AS TIME) => '12:00:00 AM' (11 characters)

This caused errors when attempting to load time data.

If we're using the ANSI SQL time formats, as specified in the documentation, the max length of the time fields should be altered to account for the following specified ANSI format: "HH:MM.SS.MMMM", which assumes that we're all using 24 hour time.

Perhaps the time fields should use the "time" data type rather than strings?

cgreich commented 9 years ago

Bill: Do you have the correct DDLs?

wistephens commented 9 years ago

Yes, the current DDL for Postgres specifies varchar(10) for all time columns.

Specimen example: specimen_time VARCHAR(10)

cgreich commented 9 years ago

Oh. I meant to say, do you have DDLs the way you want them, with the type "time"?

pbr6cornell commented 9 years ago

George had proposed that we consider making the time fields to have datetime type . The downside is the date portion would be redundant with the date field, but the good part is this would allow a recognized data format that could work cross platform. It seems a reasonable proposal to me and probably better than having a varchar field that isn't strictly regulated. Bill, would that meet your needs if adopted in a future cdm release? We haven't yet seen anyone with a real analytical use case for the time fields yet. On Feb 10, 2015 11:05 PM, "Christian Reich" notifications@github.com wrote:

Oh. I meant to say, do you have DDLs the way you want them, with the type "time"?

— Reply to this email directly or view it on GitHub https://github.com/OHDSI/CommonDataModel/issues/6#issuecomment-73831287.

wistephens commented 9 years ago

We've updated our local version of the Postgres DDL to use time data type. This is working with our current ETL process.

Patrick, the specification of separate time field on some tables seems inconsistent. I think using datetime for instead of date on the tables (visit, observation, specimen, measurement, note) allows us to support our scenarios without the need to perform a bunch of cast/converts in the extract.

gracebrownecodes commented 9 years ago

+1 for datetime values across the model. That's what we've decided (after long conversation) to go with for PEDSnet.

schuemie commented 9 years ago

Just a cautionary note on using datetime everywhere: a lot of our applications rely on the fact that if you subtract two dates you get an integer (representing the number of days). For datetimes the result is an interval that needs further processing.

I'm sure we can come up with the right SQL and translation rules to make it work, but we're not there yet.

pbr6cornell commented 9 years ago

To clarify my comment: I believe we have to maintain all current date fields as date type. not only do all our analytics expect it, but many data sources only contain dates (without times). Many of our dates (without time) are required fields. I was only suggesting that the time fields in cdmv5 could be considered in future releases to be typecast as datetime type. This would not interfere with exisiting data or analytics, since the time fields are all optional, but may enable those with the information to have a standard representation. On Feb 11, 2015 4:37 PM, "Martijn Schuemie" notifications@github.com wrote:

Just a cautionary note on using datetime everywhere: a lot of our applications rely on the fact that if you subtract two dates you get an integer (representing the number of days). For datetimes the result is an interval that needs further processing.

I'm sure we can come up with the right SQL and translation rules to make it work, but we're not there yet.

— Reply to this email directly or view it on GitHub https://github.com/OHDSI/CommonDataModel/issues/6#issuecomment-73970001.

gracebrownecodes commented 9 years ago

I'm under the impression that PEDSnet members ARE interested in doing time queries. The PCORnet CDM also has fields for storing time data, so I'm assuming they are expecting queries on that information.

We are certainly hoping to take advantage of OHDSI developed applications for the PEDSnet data, but I'm not sure we can give up the datetime data type, either.

What has to be done to allow for this extension of the OMOP CDM without breaking compatibility with existing tools? I believe I ran into this before with Achilles and adding a line in the SqlRender replacement patterns fixed it.

pbr6cornell commented 9 years ago

If you want to store time, you should put it in the time field, that's why it's there (and in fact was added specifically for pedsnet and the other pcori cdrns who asked for it). I would not recommend changing the date field. If you change from the cdm standard to make the date fields to datetime, you will have to modify all analyses that expect dates only.

Cheers,

Patrick On Feb 12, 2015 2:29 PM, "Aaron Browne" notifications@github.com wrote:

I'm under the impression that PEDSnet members ARE interested in doing time queries. The PCORnet CDM also has fields for storing time data, so I'm assuming they are expecting queries on that information.

We are certainly hoping to take advantage of OHDSI developed applications for the PEDSnet data, but I'm not sure we can give up the datetime data type, either.

What has to be done to allow for this extension of the OMOP CDM without breaking compatibility with existing tools. I believe I ran into this before with Achilles and adding a line https://github.com/OHDSI/SqlRender/blob/master/inst/csv/replacementPatterns.csv#L56 in the SqlRender replacement patterns fixed it.

— Reply to this email directly or view it on GitHub https://github.com/OHDSI/CommonDataModel/issues/6#issuecomment-74135791.

cgreich commented 9 years ago

Aaron: The CDM has the type "time" for time-containing fields in the specs (e.g. http://www.ohdsi.org/web/wiki/doku.php?id=documentation:cdm:observation). You are totally fine changing this from char() to something that can contain a proper time. Just don't touch the dates. And give us the DDL once you are done, so we can spread the wisdom.

gracebrownecodes commented 9 years ago

So no time until V5 then (except for observation). I have already seen many identical rows come out of our ETL in visit_occurrence, and possibly elsewhere, that represent the same event done repeatably within a day. Specifically, I saw multiple imaging visits in the same day for an inpatient that were indistinguishable without time information.

The problem with splitting date and time is with converting between timezones, or to UTC. There's no way to increment or decrement the date field when the time field moves across midnight. So we could store full datetime in the time field and date only in the date field, but that seems excessively redundant.

On the other hand, when you store dates without times in a datetime field, they are usually automatically assigned a time of midnight, which causes other problems for analysis (and again, for timezone shifting).

My preference would be to store everything as datetime (no timezone), store timezone names (olson, which allow for reliable conversion using the tz database) on location records which can be linked to the clinical data, and then each analysis decides how it wants to cast the data for the intended purpose. But, I don't maintain the analysis packages ;)

cgreich commented 9 years ago

Aaron: Wait: You want that for Version 4? We can't fix that one without a time machine. If you want to implement a local version with datetime, and then make sure that you fix it in all tools and applications that assume a date only, you should go ahead. But why not go to V5?

The time zones: Does that really happen within a patient? When does a patient move time zones and is in the midde of an acute treatment where time is of the essence?

As Patrick said: A diff between two dates is expected to be a number of days. And those clauses are all over the place in tons of different places. You don't want to go there. :)

gracebrownecodes commented 9 years ago

We have a local version of V4 with datetime, so I'm very interested in evaluating what we would need to change. So far, the only OHDSI tool we are using is Achilles.

Not that I know of so far, although one of our sites does straddle a timezone boundary and have care sites on either side. But certainly across patients if you want to execute any sort of "instant in time" query like, I don't know, ER visits after the Super Bowl. I thought it was generally accepted that date and time should be stored together (see "epoch", e.g.)?

It seems to me like this is one place where the "type safety" of databases falls down (certainly between the "same" schema in different DBMSs), so the analysis applications should make sure to cast to what they want.

When I get a chance, I'll take a look at the Achilles code and see how much work it might take to do the casting. I'd be happy to contribute the casting modifications to the community, if you would accept.

gracebrownecodes commented 9 years ago

Cloning the Achilles code base and using git grep 'end_date\|start_date\|procedure_date\|observation_date', I get 304 lines, all in Achilles_v4.sql and Achilles_v5.sql.

Most of those look like they wouldn't be affected by a datetime type because they are comparisons, order bys, or part (year, month) extractions.

Doing git grep 'DATEDIFF' I get only 28 results, where I would have to add CAST(@a AS DATE) to each of two arguments.

Does that sound right?

vojtechhuser commented 8 years ago

Can we make this official in the DDL? (if many sites are doing it anyway?)

Replace in DDL the varchar10 with DATETIME

For example here

CREATE TABLE measurement 
    ( 
     measurement_id                 INTEGER         NOT NULL , 
     person_id                      INTEGER         NOT NULL , 
     measurement_concept_id         INTEGER         NOT NULL , 
     measurement_date               DATE            NOT NULL , 
     measurement_time               VARCHAR(10)     NULL ,
clairblacketer commented 7 years ago

CDM v5.1 release will have both date and datetime fields and there is ongoing discussion in the workgroup meetings when to officially switch to datetime only.

clairblacketer commented 7 years ago

This is related to issues #61 and #60