ga4gh / ga4gh-schemas

Models and APIs for Genomic data. RETIRED 2018-01-24
http://ga4gh.org
Apache License 2.0
214 stars 114 forks source link

Improve consistency of timestamp usage #690

Closed david4096 closed 1 year ago

david4096 commented 8 years ago

Timestamps are available across the API and this PR replaces the mixture of timestamp implementations by settling on ISO 8601. Created and updated fields were added to the dataset, feature, and feature set messages. Field names in assay metadata were renamed to be consistent with the remainder of the data model. Closes #682 and closes #685 and closes #690

Found mention of timestamp in intro.rst and replaced with newer definition. Thanks @mbaudis

mbaudis commented 8 years ago

+1

Great & fast response - thanks @kozbo & @david4096!

ejacox commented 7 years ago

Are timestamps necessary for Features? How would they differ from the FeatureSet timestamps?

Similarly, do individual variants need timestamps?

What is the difference between the timestamps in Datasets and those in FeatureSets?

david4096 commented 7 years ago

No, but they are always optional.

The BRCA Exchange has recently put in some effort to exploring how to satisfy the FDA revision requirements for data used for medical purposes. The idea is that the date accessed becomes important when the annotations are updated so often. I think that this timestamp model fits the most basic use case of a client being able to determine if a variant has changed from last access.

This does not completely satisfy the needs of reliable replication, but it is a step in the right direction.

The reference server's implementation would not have variants modified without modifying a variant set, but the BRCA Exchange's implementation might. The same could be true of GENCODE feature sets, wherein I will be able to see if any new featuresets have been added newer than the ones I have already downloaded.

An implementor doesn't need to return timestamps, but it needs to be a possible field. I would assume that if a nested document didn't have a timestamp that it inherited from its ancestor. Perhaps stating this clearly in the documentation would be helpful. I can't construe a situation where the reference server ought to return created timestamps for feature messages, but I can imagine other implementations that might.

mbaudis commented 7 years ago
An implementor doesn't need to return timestamps ... needs to be a possible field

Exactly. We cannot force complex data scenarios, but have to enable their implementation.

ejacox commented 7 years ago

OK. Thanks. I would be careful with the its always optional argument. Unnecessary or poorly defined fields could become a pain to maintain and also affect usability. Adding fields can be easier than deleting them.

ejacox commented 7 years ago

The comments for timestamps need to be rewritten to define what the timestamp is for. The current comments are too vague. Does creation time refer to when the message was added to the server? Why would anyone care when the data was loaded? If not, is there existing data with timestamps or are we being wishful? Why do all the comments say record? Should this be message?

The current server uses very few, if any, of the timestamps. Some are just set to the current time. The input methods seem to be an afterthought. Are timestamps required in any cases? Has there been any discussion about enforcing those? If not, why is this a field and not just metadata?

These fields might optional in the message. In the server, they will take up disk space. Do we want to add them to the server if no one is using them?

ejacox commented 7 years ago

Record seems to apply to biosample or individual. I don't think it applies to other things, like datasets. It should be more semantically clear.

ejacox commented 7 years ago

We are proposing to put created and updated into a metadata message. That way, any changes are just in one place, rather than in every message. Additionally, the description needs to be better. I propose: created = "The time the underlying data was produced (i.e. when the experiment or analysis was run)." updated = "The time the underlying data changed. This time is independent of when the data was loaded."

These fields describe the data. Any dates associated with determining what data was available on a particular server at a particular time would be separate.

mbaudis commented 7 years ago

@ejacox The specification "underlying data" is a good change, to separate from server timestamps. Apart from that, any record should have these as optional; however, the API may be designed as to evaluate records for their overall latest change (but that wouldn't make sense for creation ... anyway confuses things, and I don't really see the problem?).

mbaudis commented 7 years ago

Modification: Created, Updated don't have to be part of Variant (but Variantset, Callset). The comments were not specific enough to make this clear.

andrewjesaitis commented 7 years ago

+1 to @ejacox that create/update should refer to the data. +1 to @mbaudis that they should be optional

My only remaining question is if there really a practical differentiation between created and updated at the data level. I mean if the analysis/alignment/etc is rerun if that really an update or is it just all new data? But, (to argue with myself) in the case of phenotype data, or sample attribute collection, I could see the need to differentiate between creation and update (eg we mislabeled all the samples and needed to correct the error). So I'd say to be consistent across the api keep both everywhere and make them optional.

I think it's important to keep in mind that the schema is only defining what should come back in the message and not how the data is stored.

david4096 commented 7 years ago

I've placed the created and updated fields in their own message in common.proto and moved that throughout the API.

The VariantSet.metadata message was renamed to VariantSet.variant_set_metadata because of the name collision.

ejacox commented 7 years ago

+1

david4096 commented 7 years ago

This PR is on hold while we review how best to approach timestamps in the API. https://github.com/ga4gh/schemas/issues/839