GA4GH APIs need to address scientific reproducibility (propose immutable datatypes)

benedictpaten commented 10 years ago

Consider a researcher who writes a script against the GA4GH APIs, accesses data and publishes the results. The current APIs do not guarantee that subsequent researchers will get the same result when running the original script, therefore the published results are not assured to be reproducible.

If GA4GH APIs are to really going change the way bioinformatics is done they need to facilitate the reproducibility of results. In order for results to be reproducible one needs to be able obtain the exactly the same data and associated metadata that were used in an experiment. For the GA4GH APIs this means that every time a given data object is returned it is always the same. This means that APIs must present data as immutable. Data objects are never modified, instead new derived versions are created.

Mark Diekhans, David Haussler and I think this is important to address and that it would be relatively straightforward to implement immutability into an update of the v0.5 API. What do people think?

massie commented 9 years ago

@delagoya Agree.

For the data content, we can use a position-independent hash like CRC32 where order doesn't matter (you can feed bytes in any order you like). The CRC32 will allow us to validating data integrity but, being only 32 bits, will not provide the unique identifier we need since collisions are fairly common.

diekhans commented 9 years ago

Michael Baudis notifications@github.com writes:

... and for immutable objects (that is, most likely raw data, reads ... ?) you can add a hashedID. Many of the metadata objects (e.g. GAIndividual) will change content over time.

Ah, but that is the point of making all object immutable. Objects don't change, new ones, with new ids are created from the old ones. If solves version tracking and state management.

Getting the metadata correspond to a particular experiment is as important as getting the data. For example, trying to understand why you get different than a previous experiment might hinge on the fact that the metadata was wrong in the previous experiment.

We have spent an incredible amount of time trying to straighten out a metadata mess for only ~6200 sequencing runs because the metadata was mutable and modified with no way to track what was done.

pgrosu commented 9 years ago

One small request if possible. Since some aligners such as SNAP use seed strings, can we have our API automatically generate/update a variety of inverted indices using the genome seed strings, to store information about the reads/variants/annotation/etc. for faster searches. This concepts is used in Information Retrieval, and would help out tremendously with a lot of the later analysis - plus the update step would be fairly fast. Here are a couple of examples:

inv_index_call

Or for annotations we can have it reversed:

inv_index_disease

These can be distributed in parallel using Parquet like in ADAM, or we can adapt to other possibilities depending on what processing we want to perform. This can be extended to updates in parallel for variant calling and annotations, though some changes would need to be implemented. Also since genome assemblies of the same species would have minor variations compared to the whole genome, few small new seeds would be required to be added with their appropriate information updated.

lh3 commented 9 years ago

For a more concrete proposal, I suggest we add union {string,null} stableName=null to GAVariantSet (sorry that I was saying GACallSet but I really meant GAVariantSet) and allow it to be requested. If stableName is not null, the whole VariantSet should not be updated, though the data associated with it may be completely deleted later. If the data is deleted, the stableName should not be reused in future. The stableName could be accessions or UUIDs, entirely up to the implementors to decide. Hashes and UUIDs for all objects are useful, but we may discuss these in another thread.

For other objects, we can add stableNames to objects currently accessioned by SRA.

Alternatively we can add two fields string stableName; bool released; to GAVariantSet. This may be cleaner and more flexible.

pgrosu commented 9 years ago

@lh3, I know what you're trying to say but if it's considered stable then that's a tautology. If the data can be erased then that's a contradiction. It is either is locked or otherwise it is not stable. Usually there are several levels of promotion to a dataset. Again this can lead to a whole mess from what I've seen in the past. You need an organizational layer of structure with oversight.

vadimzalunin commented 9 years ago

@pgrosu agreed, this should be a one-way road DRAFT->FINAL->SUPPRESSED. There is no need to imply stability just from name. Since this is so important why not have a separate status for this? The API should be explicit where it matters. PS Suppressed SRA objects still have stable names and must be accessible.

pgrosu commented 9 years ago

@vadimzalunin, I agree with the one-way approach, and to capture it and the rest, I posted it as a new issue as #143.

adamnovak commented 9 years ago

Integrating over everyone, it looks like we need:

1) IDs we can use to retrieve data sets and be sure we got the same data as was used in a publication (if we get it at all).

2) The ability to cheaply update/replace data sets when we're testing pipelines or adding samples to experiments or adding annotations to existing data sets.

I agree with the idea of "don't give global IDs to data sets that you want to update". Whether those IDs should be hashes or not is not really clear to me. I'm not sure the cost of rehashing on updates will be high in practice, since you don't really take a freeze, give it a minor update, and declare it a new freeze. However, the finnickyness of getting everything down to identical bits to be hashed is a hard problem.

fnothaft commented 9 years ago

Moving over from #135, cc @benedictpaten @cassiedoll @pgrosu @diekhans, also CC @massie who I know is interested.

Hi people,

This is something we will discuss at ASHG (Stephen marked it as an ASHG topic, thanks!). Gil and I think it would be good to get a point person for each ASHG topic. I nominate (he can disagree) Mark Diekhans as a person for this reproducibility issue. He created some nice slides on his views (which I share) both of the issue and how we might tackle it:

https://www.dropbox.com/s/v8gu5rlo9yaeack/ga4gh-functional-objects.pdf?dl=0

From a quick glance, this looks reasonable; one concern with the pointing approach from the last slide arises with respect to deletions. E.g., if you point by reference and delete GAReadGroupSet 00100, do you then recursively try to delete GAReadGroup 00200 & 00300? If you don't, do you then need to manually reclaim blocks, do you "garbage collect" un-referenced blocks, etc? This decision will impact the API semantics/implementation.

cassiedoll commented 9 years ago

(I'm removing the task team labels, as this is now covered under the ASHG topic umbrella)

benedictpaten commented 9 years ago

Thanks Frank, deleted the post from the other topic.

tetron commented 9 years ago

Content addressing (identifying data by a hash of the contents) is the a very powerful technique, forming the basis for systems such as Git and Arvados Keep. Deriving the identifier from the content, as opposed to assigning a random database ID, enables third party verification that the content and identifier match. When it is necessary to assign a human readable name and logically update a dataset, one can use techniques like git branches or Arvados collections which use a updatable name record which simply points to a specific content hash.

Even if the underlying database does not support versioning, so past versions are not stored and thus inaccessible, providing a content hash field at least provides knowledge that content has changed substantively in a way that is not captured by a simple timestamp field.

One challenge with hashing is that it is essential to define a bit-for-bit precise "normalized form" for a given data record so that different implementations will produce the same identifier given the same data. When using structured text formats such as JSON, this is tricky because differences in whitespace and object key ordering don't change the semantics of the actual record but will change the computed hash identifier.

Hash identifiers can be computed and provided alongside existing database identifiers, so it is not necessary to choose to use one or the other (although users may need to be educated when to use one or the other).

larssono commented 9 years ago

I realize that I am joining the conversation rather late. But scanning through the discussion it seems that there are many ideas that overlap and are related and will drop my 2c. In order to be able to record provenance (and by provenance I don't necessarily mean being able to reproduce a result identically to double precision but reproduce in principle) it is necessary to store versions and if you are storing versions it is no longer enough to only reference elements by a globally unique identifier the relationship between identifiers as different versions of each other is important. Furthermore it becomes useful to publish combinations of versions into freezes - much the same way that software commits can be tagged for a release.

In Synapse we have taken the approach that every item is referencable by three methods: an accession id, a globally unique identifier i.e. a hash of the data, and the provenance that generated it. One piece of data has an accession and multiple versions of each has version numbers. So for example a piece of data might have accession syn123 and version 2 of this would be accessible by syn123.2 (not specifying a version returns the latest version). These versions might also be retrieved by a md5 hash of the data or can be retrieved by observing a graph of the provenance as specified by the W3C spec (http://www.w3.org/TR/2013/REC-prov-dm-20130430/)

delagoya commented 9 years ago

Closing this issue for lack of PR or recent comments. It seems that #167 takes precedence for this issue.

awz commented 9 years ago

@delagoya is the Containers and Workflows task-team working on this issue? Maybe wait until they make progress before closing this issue? It seems to be pretty important and have quite a lot of content that is referenced by #167. Also pinging @fnothaft and @tetron since they may feel happy to close this.

ga4gh / ga4gh-schemas

GA4GH APIs need to address scientific reproducibility (propose immutable datatypes) #142