HumanCellAtlas / metadata-schema

This repo is for the metadata schemas associated with the HCA
Apache License 2.0
64 stars 32 forks source link

Values entered as dates only should be valid in date-time typed fields #539

Closed malloryfreeberg closed 5 years ago

malloryfreeberg commented 6 years ago

For which schema is a change/update being suggested?

Affected fields are in the cell_line.json, death.json, and specimen_from_organism.json schemas.

What should the change/update be?

The following fields are currently typed to JSON date-time. As per the JSON validator, valid values must contain the hh:mm:ss part, but oftentimes the time is not recorded or only a year/year-month is allowed to be supplied. In these cases, a user's value will appear "Invalid", which is not desirable.

  1. cell_line.date_established: change from JSON type date-time to type date given that we ask for the "date" the cell line was established, not also the "time".' If we think that the time a cell line was established is important to understand the dataset, we should change the field name to time_established. (#821)
  2. death.time_of_death: only supplying a "date" should be valid. Also, only supplying the year-month or just the year should be valid.
  3. specimen_from_organism.collection_time: only supplying a "date" should be valid. Also, only supplying the year-month or just the year should be valid.

What new field(s) need to be changed/added?

No new fields requested.

Why is the change requested?

For an example dataset given to us with time_of_death information, only the death year-month was recorded and thus a dummy "00T00:00:00Z" bit had to be added to the end of the date to be valid against the JSON date-time format. Expecting users to do this is not appropriate and might not fit the ethical standards for the data.

collection_time might also have some ethical restrictions in cases where a specimen was collected from a dead donor, and the specimen collection_time was recorded (thus potentially identifying the donor).

For date_established, we want to be sure that we are expecting a date if the word "date" is in the field name and a time if the word "time" is in the field name.

willrockout commented 6 years ago

@malloryfreeberg Out of curiosity if these formats are changed to just Date with a error be thrown if time is included?

malloryfreeberg commented 6 years ago

@malloryfreeberg Out of curiosity if these formats are changed to just Date with a error be thrown if time is included?

I believe so, yes.

willrockout commented 6 years ago

Then do we want to keep this format since it will negate people being able to put time if they have it?

malloryfreeberg commented 6 years ago

If a field like cell_line.date_established specifically asks for a date, then we should assert that time won't be accepted and ask data contributors to correct it (either in their spreadsheet or, later, directly in the UI) if they entered a time. If a field like death.time_of_death could potentially have a time, a user should be able to omit the time and just give a date, for personal identification reasons.

diekhans commented 6 years ago

Another approach would have the spreadsheet or submission program drop the time.

Mallory Freeberg notifications@github.com writes:

If a field like cell_line.date_established specifically asks for a date, then we should assert that time won't be accepted and ask data contributors to correct it (either in their spreadsheet or, later, directly in the UI) if they entered a time. If a field like death.time_of_death could potentially have a time, a user should be able to omit the time and just give a date, for personal identification reasons.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.*

malloryfreeberg commented 5 years ago

Another approach would have the spreadsheet or submission program drop the time.

Sure, we could encode the importer to either drop the time value if supplied for a date field or add a stub time value if not supplied for a date-time field. We'd probably need to tell the contributors this happened so they don't wonder why the resulting value is different from what they submitted. If we didn't go the importer route, we'd have to hack the javascript validator, which sounds like more of a risk.

In summary, point 1 in this ticket is being addressed in another ticket (#821).

I suggest points 2 and 3 be addressed by some importer code, unless there is a strong desire to hack the js validator. Thoughts on this? @simonjupp @tburdett @daniwelter @diekhans @justincc

justincc commented 5 years ago

Where is type checking performed right now?

malloryfreeberg commented 5 years ago

Where is type checking performed right now?

In whichever bit of ingest validates generated JSON documents against the schema. Is that the broker?

justincc commented 5 years ago

Where is type checking performed right now?

In whichever bit of ingest validates generated JSON documents against the schema. Is that the broker?

I thought that was being done by the Javascript so 2) 3) would be in the validator. @simonjupp am I wrong?

simonjupp commented 5 years ago

I think the importer needs to handle this. It should parse the value in the sheet and write it to json in the format dictated by the schema. There will be date libraries to help with any conversions in case the user enters the wrong date format.

diekhans commented 5 years ago

If not already done, conversion to GMT needs to be part of this.

tburdett commented 5 years ago

I'd recommend against having the importer do this. Wherever we have the importer make changes, these changes are hard-coded against the spreadsheet format and will fall out of sync with the schema over time. Worse, such changes are essentially rendered invisible to the submitter. This makes it impossible to round-trip between spreadsheet and JSON representations and creates the requirement to store spreadsheets "forever" for traceability.

Hacking the validator to do this is also a bad choice. I'd suggest exploring the possibility of adding a simple additional component in the ingest dataflow that is purely responsible for review/repair. This can cover common error cases (e.g. date/time conversions) and future enhancements (e.g. addition of automatic ontology mappings). This way, we'll have a clear provenance trail in the ingest database of what modifications occurred, by whom, and when.

diekhans commented 5 years ago

so which comes first, validation or repair? Matching the schema seems to be necessary for repair. However, when/if we add semantic validation of the metadata, it that will need to be post-repair.

tburdett commented 5 years ago

I'd see validation as coming first; automatic repair is therefore triggered by certain recogniseable error conditions (e.g. specimen_from_organism.collection_time not in required format 'YYYY-MM-DD hh:mm:ss'). The UX team have a series of designs for review/repair screens that would allow the submitter an opportunity to accept/reject proposed automated modifications. To support this we will need some formal representation of changes which might be simple to accomplish given clear provenance over changes made by a review/repair component. Given the code is likely similar in both cases (new component vs modifying the importer), I'd advocate for a solution that preserves the provenance trail.

malloryfreeberg commented 5 years ago

Given that we're talking about review/repair to allow partial date/date-time submission and ensure adherence to GMT, it sounds like this issue won't be solved for a while.

Wranglers will have to decide what to do for these fields in the meantime if contributors are not able to provide the full date/date-time.

lauraclarke commented 5 years ago

For the wranglers in the meantime, I would recommend being pragmatic as possible and documenting way assumptions are made. This is something it would be easy to over think

My proposal would be to be as precise as possible but if you don't have the say give the first day of the month and if you don't have the month give say middle month of the year and if you don't have the year then make the date something ridiculous which clearly can't be true if it isn't possible to make it null or n/a

daniwelter commented 5 years ago

@lauraclarke It shouldn't be necessary to make up random ridiculous dates at this stage as none of the possible submitter-provided date-time fields are required - the only required date-time fields are in programmatic contexts such as provenance and analysis. I agree with the overall approach to round to midnight if no time is provided, first day of the month for YY-MM etc

lauraclarke commented 5 years ago

If none of the date fields are required then it is indeed much easier as they can be blank if not provided.

malloryfreeberg commented 5 years ago

As a concrete use case, one of the data contributors were willing to provide year and month of death for a donor, but not the day or time. Since we couldn't validate just a month and year, and they weren't comfortable making up a day (in case it was the true day and thus made their donor more identifiable), they just didn't supply the metadata.

I want to avoid discouraging users from providing metadata because we are technically constrained by JSON validation.

diekhans commented 5 years ago

@tburdett I think the pipeline might be

  1. syntax validation (check JSON conforms to schema) ->
  2. correction/repair/consistency ->
  3. semantic validation (see that fields make sense together, etc)

The reason being:

malloryfreeberg commented 5 years ago
  1. syntax validation (check JSON conforms to schema) ->
  2. correction/repair/consistency ->
  3. semantic validation (see that fields make sense together, etc)

I really like this approach! #2 seems very far off, though :( and we desperately need #3, even now. Wranglers spend a lot of time manually checking that the contributors entered correct values. In a very recent, important case, Marcus noticed that some 10x experiments were marked as "strand": "unstranded", some as "strand": "first", and some as "strand": "second". 10x must be marked as first strand, IIRC. It would be good to validate this during import and suggest the change if the assay is type 10X but the strand is not "strand": "first" (many people don't know this about the library preparation step and will just guess).

tburdett commented 5 years ago

@diekhans agreed re: the pipeline. The ingestion service has a component for 1. (ingest-validator) but not 2. or 3. currently. 1. reports on syntax validation, and reports from our chosen JSON schema validator are quite well structured so specific categories of failures are easy to recognise (and therefore repair in known conditions) - see e.g. https://github.com/epoberezkin/ajv#error-objects.

Overall I agree that this seems like a sane design and gives us the flexibility to address a variety of needs and use cases. I don't see that item 2. is necessarily very far off though. @justincc should comment as this will obviously depend on other priorities but a simple minimal version should be fairly straightforward to set up in our current infrastructure (even if it only does some very basic fixes like date-time format conversions in the first instance).

justincc commented 5 years ago

I'm having difficulty relating this discussion to your original use cases @malloryfreeberg. Regarding your concrete use case above, are we saying that

daniwelter commented 5 years ago

I have an alternative suggestion to would take the onus of addressing this off the ingest and/or validation service. We could change the affected metadata schema properties to use the oneOf keyword and allow both a full date/date-time (as appropriate) and a reduced field matching a defined regex pattern.

Example:

"time_of_death": {
     "description": "Date and time when death was declared.",
     "type": "string",
      "user_friendly": "Time of death",
      "example": "2016-01-21T00:00:00Z",
      "guidelines": "Enter the time in date-time format: yyyy-mm-ddThh:mm:ssZ."
     "oneOf": [
         {
            "format": "date-time"
        },
         {
            "pattern": "^(19|20)\d\d([- /.])(0[1-9]|1[012])$",       
         }
   ]
 }
malloryfreeberg commented 5 years ago

Solving the "entering partial date-time" issue is a requirement for ingesting a GA dataset that we have now. I am ok with moving ahead with Dani's suggestion, although I might change the pattern to be more permissive for including all partial version of the date-time. I would like to move this to the current sprint so it's not a blocker for ingesting this dataset.

malloryfreeberg commented 5 years ago

Replaced by https://github.com/HumanCellAtlas/metadata-schema/issues/966