CDCgov / prime-reportstream

ReportStream is a public intermediary tool for delivery of data between different parts of the healthcare ecosystem.
https://reportstream.cdc.gov
Creative Commons Zero v1.0 Universal
72 stars 40 forks source link

SPIKE -- Evaluate How we Validate Dates in ReportStream & Make Recommendations #3056

Closed loripusey closed 2 years ago

loripusey commented 3 years ago

Problem Statement

We created a custom mapper to handle dates that come from a Sender submission as a string: https://app.zenhub.com/workspaces/experience-607d9d5e68b95200150fec37/issues/cdcgov/prime-reportstream/2246

However, what we didn't know was that the CSV Serializer was doing date validation based on the field being typed as a date (often from the base COVID schema) before the custom mapper would run, thereby, nullifying the effects of the new mapper.

Acceptance Criteria

aurora-a-k-a-lightning commented 2 years ago

Some initial notes:

  1. CSVSerializer.kt is validating elements in the mapRow method on line 343 when when it calls element.checkForError. As far as I can see, HLSerializer.kt is not utilizing this function in its equivalent (or anywhere in this file). Also, to answer Lori's question about inheritance, they are not inheriting from a parent Serializer class that share functions.

  2. The checkForError method is in the Element.kt, which means that the validation is being done on the Element level, it is only being utilized on the Serializer level, which means we can possibly use this function in the HL7Serializer in the same way

  3. DATE type is in Element.kt from lines 406 - 426. It has 3 ways of trying to parse out a date. Each block is a try/catch block. In other words, it trys to parse a date, returns null if it does, and if it doesn't, it goes into the catch block and moves on to the next try block.

    a. The first try block does LocalDate.parse(formattedValue), where formattedValue is the parameter sent in to checkForError, which we are assuming is just a String. The LocaDate.parse method is a Java method that uses parse with the DateTimeFormatter.ISO_LOCAL_DATE which isyyyy-MM-dd ex: 2021-11-15

    b. The second try block sees if there is a format given in the parameters of the checkForError method and then uses that to parse the date if format is null, it uses a standard datePattern, which is defined as "yyyyMMdd" on line 881

    c. The third try block uses a "variableDateTimePattern" which is this format: "[yyyyMMddHHmmssZ][yyyyMMddHHmmZ][yyyyMMddHHmmss]", and then created a DateTimeFormatter using that pattern. It then uses the parseBest method of DateTimeFomatter, using the offset variableDateTimePattern and attempts to parse out a date from it.

    d. If none of these parse out a date, the method sends back an InvalidDateMessage, which can be found in ElementErrors.kt lines 23 - 46.

4.DATETIME type is in Element.kt from 428 - 466. It has 5 ways of trying to parse out a date. Each block is a try/catch block, just like the DATE example previously.

a. The first try block does OffsetDateTime.parse(formattedValue), which uses the ISO format to parse out a date from a string i.e.: yyyy-MM-dd'T'HH:mm:ssZZZZZ

b. The second try block sees if there is a format given in the parameters of the checkForError method and then uses that to parse the datetime if the format is null, it uses a standard datetimePattern, which is defined as "[yyyyMMddHHmmssZ][yyyyMMddHHmmZ][yyyyMMddHHmmss]" on line 883

c. The third try block tries to parse out a datetime using a LocaDate pattern in the ReporStream canonical dateFormatter, which looks like this DateTimeFormatter.ofPattern(datePattern, Locale.ENGLISH) - this means it is using this date format "yyyyMMdd" to try to parse out a datetime

d. The fourth try block tries to parse using the variableDateTimePattern (akin to its DATE type counterpart) with the parseBest method within a timerange

e. The fifth try block tries to parse using the LocalDate, instead of the OffsetDateTime, with a defined format, or if format is null, then the datetimePattern ("[yyyyMMddHHmmssZ][yyyyMMddHHmmZ][yyyyMMddHHmmss]")

  1. Right underneath where mapRow is calling checkForError, we are calling the toNormalized method of Element.kt line 545. This method has almost the exact same code as checkForError as far as "normalizing" dates go. The difference here is that toNormalized returns null in the catch blocks, otherwise, if the date parses correctly, it returns normalDate.format(dateFormatter) for DATE type, and normalDateTime.format(datetimeFormatter) for DATETIME type. Also, in the final catch blocks, it is not using the same InvalidDateMessage object as checkForError, instead, it is writing an error: error("Invalid date: '$formattedValue' for element $fieldMapping")

  2. Something to keep in mind: if the Schema definition for an Element defines the Element type as being DATE and the sender sends a "date" with a "time" on it, it may not be a valid date and the value will be replaced with "**^^validationFail**".

loripusey commented 2 years ago

@ahay-agile6 Is it your recommendation then that we focus on doing all date validation in the checkForError function that the CsvSerializer uses and abandon use of the mapper that we just built? Also, I would think that we should handle as many date format scenarios as we can think of and handle scenarios where the element is DATE type but user sends a date/time and a DATETIME type where the user sends only a date. I don't think we should throw out or error out on data where there is a legitimate value and all we would have to do is to strip off the time value (for DATE type) or add a midnight time value (for DATETIME type). Thoughts? cc: @MikeC-A6 @rhood23699

aurora-a-k-a-lightning commented 2 years ago

@loripusey after speaking with the onboarding team, we came up with these recommendations:

Short Term:

  1. Yes! Remove our mapper that we made because the checkForError/toNormalized overwrite bad dates in the first place.
  2. Use the checkForError/toNormalized methods to handle the DATE Element type normalization. Add the scenarios of what we know/can predict how senders will manually enter in DATEs, including with datetimes, and add these senarios to the checkForError/toNormalized methods.

The two recommendations above will solve the data quality issue for dates being check and/or being set to blank when it is a "bad date".

Long Term: The next recommendation is for resolving a few inconsistencies and complexities in the code base that may help bring these types of issues to light

  1. Create another ticket to investigate why checkForError exists because a) it is not used in the HL7Serializer and b) it does not have the same code as toNormalized, but they are extremely similar (DRY principle)
  2. From the investigation, re-design/consolidate/refactor checkForError/toNormalized in the CSVSerializer and/or add the checkForError to the HL7Serializer if needed
aurora-a-k-a-lightning commented 2 years ago

https://app.zenhub.com/files/304423150/8c349b9a-6d7e-4046-89d9-8aa1d68e8340/download - Has examples of the manual entered date formats that we will need to add, written in Kotlin.

https://app.zenhub.com/files/304423150/93f11950-1c45-45d8-a8c6-f388e1b7924d/download - Example test data file for the Experity/CUC-al-covid-19 schema.

Input the data via the prime cli

./prime data --input result_files/CUC-Experity-Test2.csv --input-schema Experity/CUC-al-covid-19 --output-dir junk --route 
loripusey commented 2 years ago

I'll add 3 new tickets for the short & long-term fixes; thanks @ahay-agile6 https://app.zenhub.com/workspace/o/cdcgov/prime-reportstream/issues/3081 https://app.zenhub.com/workspace/o/cdcgov/prime-reportstream/issues/3082 https://app.zenhub.com/workspace/o/cdcgov/prime-reportstream/issues/3083