Closed malloryfreeberg closed 5 years ago
@malloryfreeberg Out of curiosity if these formats are changed to just Date
with a error be thrown if time is included?
@malloryfreeberg Out of curiosity if these formats are changed to just
Date
with a error be thrown if time is included?
I believe so, yes.
Then do we want to keep this format since it will negate people being able to put time if they have it?
If a field like cell_line.date_established
specifically asks for a date, then we should assert that time won't be accepted and ask data contributors to correct it (either in their spreadsheet or, later, directly in the UI) if they entered a time. If a field like death.time_of_death
could potentially have a time, a user should be able to omit the time and just give a date, for personal identification reasons.
Another approach would have the spreadsheet or submission program drop the time.
Mallory Freeberg notifications@github.com writes:
If a field like cell_line.date_established specifically asks for a date, then we should assert that time won't be accepted and ask data contributors to correct it (either in their spreadsheet or, later, directly in the UI) if they entered a time. If a field like death.time_of_death could potentially have a time, a user should be able to omit the time and just give a date, for personal identification reasons.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.*
Another approach would have the spreadsheet or submission program drop the time.
Sure, we could encode the importer to either drop the time value if supplied for a date field or add a stub time value if not supplied for a date-time field. We'd probably need to tell the contributors this happened so they don't wonder why the resulting value is different from what they submitted. If we didn't go the importer route, we'd have to hack the javascript validator, which sounds like more of a risk.
In summary, point 1 in this ticket is being addressed in another ticket (#821).
I suggest points 2 and 3 be addressed by some importer code, unless there is a strong desire to hack the js validator. Thoughts on this? @simonjupp @tburdett @daniwelter @diekhans @justincc
Where is type checking performed right now?
Where is type checking performed right now?
In whichever bit of ingest validates generated JSON documents against the schema. Is that the broker?
Where is type checking performed right now?
In whichever bit of ingest validates generated JSON documents against the schema. Is that the broker?
I thought that was being done by the Javascript so 2) 3) would be in the validator. @simonjupp am I wrong?
I think the importer needs to handle this. It should parse the value in the sheet and write it to json in the format dictated by the schema. There will be date libraries to help with any conversions in case the user enters the wrong date format.
If not already done, conversion to GMT needs to be part of this.
I'd recommend against having the importer do this. Wherever we have the importer make changes, these changes are hard-coded against the spreadsheet format and will fall out of sync with the schema over time. Worse, such changes are essentially rendered invisible to the submitter. This makes it impossible to round-trip between spreadsheet and JSON representations and creates the requirement to store spreadsheets "forever" for traceability.
Hacking the validator to do this is also a bad choice. I'd suggest exploring the possibility of adding a simple additional component in the ingest dataflow that is purely responsible for review/repair. This can cover common error cases (e.g. date/time conversions) and future enhancements (e.g. addition of automatic ontology mappings). This way, we'll have a clear provenance trail in the ingest database of what modifications occurred, by whom, and when.
so which comes first, validation or repair? Matching the schema seems to be necessary for repair. However, when/if we add semantic validation of the metadata, it that will need to be post-repair.
I'd see validation as coming first; automatic repair is therefore triggered by certain recogniseable error conditions (e.g. specimen_from_organism.collection_time not in required format 'YYYY-MM-DD hh:mm:ss'
). The UX team have a series of designs for review/repair screens that would allow the submitter an opportunity to accept/reject proposed automated modifications. To support this we will need some formal representation of changes which might be simple to accomplish given clear provenance over changes made by a review/repair component. Given the code is likely similar in both cases (new component vs modifying the importer), I'd advocate for a solution that preserves the provenance trail.
Given that we're talking about review/repair to allow partial date/date-time submission and ensure adherence to GMT, it sounds like this issue won't be solved for a while.
Wranglers will have to decide what to do for these fields in the meantime if contributors are not able to provide the full date/date-time.
For the wranglers in the meantime, I would recommend being pragmatic as possible and documenting way assumptions are made. This is something it would be easy to over think
My proposal would be to be as precise as possible but if you don't have the say give the first day of the month and if you don't have the month give say middle month of the year and if you don't have the year then make the date something ridiculous which clearly can't be true if it isn't possible to make it null or n/a
@lauraclarke It shouldn't be necessary to make up random ridiculous dates at this stage as none of the possible submitter-provided date-time fields are required - the only required date-time fields are in programmatic contexts such as provenance and analysis. I agree with the overall approach to round to midnight if no time is provided, first day of the month for YY-MM etc
If none of the date fields are required then it is indeed much easier as they can be blank if not provided.
As a concrete use case, one of the data contributors were willing to provide year and month of death for a donor, but not the day or time. Since we couldn't validate just a month and year, and they weren't comfortable making up a day (in case it was the true day and thus made their donor more identifiable), they just didn't supply the metadata.
I want to avoid discouraging users from providing metadata because we are technically constrained by JSON validation.
@tburdett I think the pipeline might be
The reason being:
- syntax validation (check JSON conforms to schema) ->
- correction/repair/consistency ->
- semantic validation (see that fields make sense together, etc)
I really like this approach! #2 seems very far off, though :( and we desperately need #3, even now. Wranglers spend a lot of time manually checking that the contributors entered correct values. In a very recent, important case, Marcus noticed that some 10x experiments were marked as "strand": "unstranded"
, some as "strand": "first"
, and some as "strand": "second"
. 10x must be marked as first strand, IIRC. It would be good to validate this during import and suggest the change if the assay is type 10X but the strand is not "strand": "first"
(many people don't know this about the library preparation step and will just guess).
@diekhans agreed re: the pipeline. The ingestion service has a component for 1. (ingest-validator) but not 2. or 3. currently. 1. reports on syntax validation, and reports from our chosen JSON schema validator are quite well structured so specific categories of failures are easy to recognise (and therefore repair in known conditions) - see e.g. https://github.com/epoberezkin/ajv#error-objects.
Overall I agree that this seems like a sane design and gives us the flexibility to address a variety of needs and use cases. I don't see that item 2. is necessarily very far off though. @justincc should comment as this will obviously depend on other priorities but a simple minimal version should be fairly straightforward to set up in our current infrastructure (even if it only does some very basic fixes like date-time format conversions in the first instance).
I'm having difficulty relating this discussion to your original use cases @malloryfreeberg. Regarding your concrete use case above, are we saying that
I have an alternative suggestion to would take the onus of addressing this off the ingest and/or validation service. We could change the affected metadata schema properties to use the oneOf
keyword and allow both a full date/date-time (as appropriate) and a reduced field matching a defined regex pattern.
Example:
"time_of_death": {
"description": "Date and time when death was declared.",
"type": "string",
"user_friendly": "Time of death",
"example": "2016-01-21T00:00:00Z",
"guidelines": "Enter the time in date-time format: yyyy-mm-ddThh:mm:ssZ."
"oneOf": [
{
"format": "date-time"
},
{
"pattern": "^(19|20)\d\d([- /.])(0[1-9]|1[012])$",
}
]
}
Solving the "entering partial date-time" issue is a requirement for ingesting a GA dataset that we have now. I am ok with moving ahead with Dani's suggestion, although I might change the pattern
to be more permissive for including all partial version of the date-time. I would like to move this to the current sprint so it's not a blocker for ingesting this dataset.
For which schema is a change/update being suggested?
Affected fields are in the
cell_line.json
,death.json
, andspecimen_from_organism.json
schemas.What should the change/update be?
The following fields are currently typed to JSON
date-time
. As per the JSON validator, valid values must contain the hh:mm:ss part, but oftentimes the time is not recorded or only a year/year-month is allowed to be supplied. In these cases, a user's value will appear "Invalid", which is not desirable.cell_line.date_established
: change from JSON typedate-time
to typedate
given that we ask for the "date" the cell line was established, not also the "time".' If we think that the time a cell line was established is important to understand the dataset, we should change the field name totime_established
. (#821)death.time_of_death
: only supplying a "date" should be valid. Also, only supplying the year-month or just the year should be valid.specimen_from_organism.collection_time
: only supplying a "date" should be valid. Also, only supplying the year-month or just the year should be valid.What new field(s) need to be changed/added?
No new fields requested.
Why is the change requested?
For an example dataset given to us with
time_of_death
information, only the death year-month was recorded and thus a dummy "00T00:00:00Z" bit had to be added to the end of the date to be valid against the JSONdate-time
format. Expecting users to do this is not appropriate and might not fit the ethical standards for the data.collection_time
might also have some ethical restrictions in cases where a specimen was collected from a dead donor, and the specimencollection_time
was recorded (thus potentially identifying the donor).For
date_established
, we want to be sure that we are expecting a date if the word "date" is in the field name and a time if the word "time" is in the field name.