ga4gh / ga4gh-schemas

Models and APIs for Genomic data. RETIRED 2018-01-24
http://ga4gh.org
Apache License 2.0
214 stars 114 forks source link

Remove created and updated timestamps from API #839

Open andrewjesaitis opened 7 years ago

andrewjesaitis commented 7 years ago

There has been a lot of discussion surrounding timestamps both on the Reference Server Task Team and within Github issues (#568, #682, #685, #690).

The key takeaway is that there is a lot of confusion surrounding what these fields represent, they lack clearly defined use cases, and have no consistent representation across the API. As discussed in other threads, do these fields represent the time the sample was taken, the time the first sequencing run was done, the time when the downstream bioinformatic pipeline was run, the time when the data was added to the database, the file created time, the possible timestamp in the file? The issue is only further muddied when a sample has undergone multiple analysis runs or the application backend has no concept of created/updated (as is the case with the file-backed reference server) or when comparing timestamps from two different API implementations. These limitations lead me to believe these fields are implementation specific and thus should not be included in the schema definitions.

I propose the removal of created and updated fields from all messages. If there is a well defined need for a timestamp on a message than it can be added with documentation clearly stating what the timestamp represents and in a format consistent with the API (ISO8601). I think this approach of including timestamps only where necessary will reduce confusion, increase consistency, and ease future support.

kozbo commented 7 years ago

@mbaudis

mbaudis commented 7 years ago

@andrewjesaitis From previous discussions and the documentation it is actually clear that created and updated refer to record specific dates; there is no indication that they have to do anything with record-external time points (like sample collection or such):

  // The :ref:`ISO 8601<metadata_date_time>` time at which this Biosample record
  // was created.
  string created = 6;

  // The :ref:`ISO 8601<metadata_date_time>` time at which this Biosample record was
  // updated.
  string updated = 7;

Also, all the values are (or should be) ISO 8601, which also has been settled before.

We know from many use case discussions and personal experience that such information (i.e. beyond server time stamps etc.) is of practical value. How these attributes are used locally (i.e. if they are linked to audit-grade time stamping) is irrelevant for the API; also, this cannot be solved on a global scale.

There are 2 points which I seem worthwhile for discussion:

So my take: keep/implement throughout the schema (not necessary everywhere - a variant or call don't need them, since they would be implemented in the variantset, callset etc.). Obviously ISO; everything else is a violation of the schema docs. I don't mind about name changes (created & updated were actually decided against before, but somehow they stuck then around).

andrewjesaitis commented 7 years ago

I agree that at a minimum they should be renamed record_created_time and record_modification_time and use ISO8601.

I'll let @kozbo, @david4096 and @ejacox chime in with their thoughts as well.

mbaudis commented 7 years ago

@andrewjesaitis No problem with me, since this is what I wrote back in https://github.com/ga4gh/ga4gh-schemas/issues/568#issuecomment-190268402

@jeromekelleher While I'm fine with created | updated, this was seen as ambiguous in var. discussions ("updated" could be a Boolean/flag) & so @diekhans switched it to the current. I had started the whole ISO8601 switch & documentation, but very explicitly called those recordCreateTime | recordUpdateTime, originally. ... But as said, all fine with me, since the documentation should make it clear.

So I'll leave the clean-up to whoever feels responsible.