cidgoh / DataHarmonizer

A standardized browser-based spreadsheet editor and validator that can be run offline and locally, and which includes templates for SARS-CoV-2 and Monkeypox sampling data. This project, created by the Centre for Infectious Disease Genomics and One Health (CIDGOH), at Simon Fraser University, is now an open-source collaboration with contributions from the National Microbiome Data Collaborative (NMDC), the LinkML development team, and others.
MIT License
94 stars 27 forks source link

High Priority: Collection Date - level of precision #137

Closed griffie closed 3 years ago

griffie commented 3 years ago

For incomplete collection dates (to year, or to month) we need a "Date Unit" field with values "Year", "Month" and "Day". In the validation step, if a collection date only specifies a month or year, the Date Unit field will specify that. Then the DH should automate the filling in the rest of the missing date parts with "01" so that the date can be accepted by downstream programs that require year-month-day (YYYY-MM-DD). In the export file for CNPHI, the Date Unit field should be called Precision. We'll map that once the DH adopts the changes above.

cmrn-rhi commented 3 years ago

Testing "Sample Collection Date Unit"

Branch: data-bucket Testing Date: 2021-02-10

Have just done some testing on sample collection date unit and haven't found any issues with importing, copy-pasting, using the picklist to enter and validate values. All work fine. No matter how I add the date unit, it appears to automatically reformat the sample collection date with 01 pseudo values (before the validation step). E.g. if I import 2020 it becomes 2020-01-01 when I select year, or if I paste 2020 it becomes 2020-01-01 even if no unit has been selected.

The only usability concern I have is that if someone is adding sample collection date unit within the DataHarmonizer, after already having sample collection dates, they could accidently overwrite values in their sample collection date. E.g. I have 2020-02-18 and then accidently select month instead of year the date changes to 2020-02-01.

cmrn-rhi commented 3 years ago

I tested importing with all eligible file types using modified (and updated) versions of the validTestData as well as the test file provided by damion. However, when I did some tests on the modified (and updated) version of the invalidTestData I noticed 2020 wasn't automatically converting to 2020-01-01. I tried seeing what would happen if I paired 2020 with year, month, and day and the result was the following:

2020|day;2020-01-01|year;2020-__01|month

Not certain why this is happening, but fortunately the validation process will always catch and draw attention to these occurences.

Edit: Input: DH1311p_collection-date-unit_test-05 (invalid data - 2020 testing).csv Output: DH1311p_collection-date-unit_test-05-output (invalid data - 2020 testing).csv

cmrn-rhi commented 3 years ago

Test Files:

DH Test_2021-02-10 (sample collection date unit).zip

I only saved the output when there were unexpected results, in the future I will include the output regardless of the results.

Edit: "DH" stands for "DataHarmonizer" "DH1311p" stands for "DataHarmonizer version 0.13.11 pre-release"

ddooley commented 3 years ago

So I've made a change that when a spreadsheet is loaded, the program will stop trying to automatically correct dates into a yyyy-mm-dd format, e.g. "2020" in a date field was getting converted into 2020-01-01 on load, but now it remains 2020. That way a user will be able to manually adjust any date rather than program making assumptions about what it should be converted to. The values will trigger validation error to highlight ones that need correction.

The reason a "day" setting kept 2020 as-is is I didn't want to make assumptions about setting day and month component of what was only a year.
Similarly for month, its prompting user for month when only a year is given. In that case it assumes day is 01.

ddooley commented 3 years ago

Also, we have it that no changes are automatically made any more to month/year/day granularity (did this by renaming the "sample collection date unit" field to "sample collection date precision", since the program still involkes the auto-update on any date + unit field. Instead, any given date is converted to the given date granularity only on export to a particular target database.

cmrn-rhi commented 3 years ago

Export Testing

I doubled checked this (while testing the CanCOGen-vocabulary-fix branch) and sample collection date precision combined with sample received date behaved as you described when imported and exported.

Example

Import:

DH1315_CNPHI-Export_test-01-input (sample collection date precision)

Export (CNPHI):

DH1315_CNPHI-Export_test-01-output (sample collection date precision)

Attachments: DH-Test_2021-02-21 (CNPHI Export - date precision).zip

cmrn-rhi commented 3 years ago

Date Auto-Update Concern

The program is invoking the auto-update on any date field, not just the date + unit field pairs.

The following fields have the auto-date formatting to ensure there are value for year, month, and day - but they don't have a paired precision date column to clarify that these are not actually dated "YYYY-01-01".

Attachments: DH-Test_2021-02-21 (CanCOGeN vocabulary fix).zip

ddooley commented 3 years ago

The date auto-format function (which would be applied to all dates in a loaded spreadsheet on load) has been removed from date fields across the board, so malformed dates remain as is and are only highlighted when one presses "Validate".