GA4GH APIs need to address scientific reproducibility (propose immutable datatypes)

benedictpaten commented 10 years ago

Consider a researcher who writes a script against the GA4GH APIs, accesses data and publishes the results. The current APIs do not guarantee that subsequent researchers will get the same result when running the original script, therefore the published results are not assured to be reproducible.

If GA4GH APIs are to really going change the way bioinformatics is done they need to facilitate the reproducibility of results. In order for results to be reproducible one needs to be able obtain the exactly the same data and associated metadata that were used in an experiment. For the GA4GH APIs this means that every time a given data object is returned it is always the same. This means that APIs must present data as immutable. Data objects are never modified, instead new derived versions are created.

Mark Diekhans, David Haussler and I think this is important to address and that it would be relatively straightforward to implement immutability into an update of the v0.5 API. What do people think?

haussler commented 10 years ago

Yes, very important and fundamental to support reproducible scientific and medical analysis.

On Wed, Sep 10, 2014 at 1:10 PM, Benedict Paten notifications@github.com wrote:

Consider a researcher who writes a script against the GA4GH APIs, accesses data and publishes the results. The current APIs do not guarantee that subsequent researchers will get the same result when running the original script, therefore the published results are not assured to be reproducible.

If GA4GH APIs are to really going change the way bioinformatics is done they need to facility the reproducibility of results. In order for results to be reproducible one needs to be able obtain the exactly the same data and associated metadata that were used in an experiment. For the GA4GH APIs this means that every time a given data object is returned it is always the same. This means that APIs must present data as immutable. Data objects are never modified, instead new derived versions are created.

Mark Diekhans, David Haussler and I think this is important to address and that it would be relatively straightforward to implement immutability into an update of the v0.5 API. What do people think?

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/142.

pgrosu commented 10 years ago

Absolutely agree! This is a given axiom of science, and we must have this as a requirement. I can't count how many times I had to take a paper, and reconstruct the steps to try to get the same results, if possible. Needless to say, it was usually a painful process. In industry, we had a more stringent set of criteria that was part of our QA/validation process, which guaranteed that the data, analysis and any processing remained consistent between versions. Any changes required to satisfy a very detailed set of written criteria and pass an agreed-upon set of tests that was accompanied with quite a lot of documentation.

cassiedoll commented 10 years ago

I do not agree that this is a good idea for Variants. Read and Reference data is fairly fixed, but Variant data should be allowed to change for at least some period of time.

One of our best use cases over here is that we will help users take a continuous stream of per-sample VCF files and merge them into one logical set of Variants - which will make population analysis much easier. (Imagine you are sequencing and calling 10 samples a week over the course of a year or something)

Eventually I agree that you might want to say "this data is all done now - go ahead and depend on it forever", but the time at which that occurs is not always == creation time.

fnothaft commented 10 years ago

@cassiedoll I get what you're saying and both agree and disagree. For variants, I think some things should be immutable. Specifically, once you've got a final (recalibrated) read set, you should be able to generate "canonical" genotype likelihoods from those reads. I agree with you that final genotype calls for a single sample will depend on the sample set that you joint call against, but fundamentally, that's not changing the genotype likelihoods, it's just changing the prior.

The correct approach IMO is to ensure immutability per program run/lineage; e.g., if I process a data set (with a specific toolchain and settings), I can't go back and reprocess part of that data with a new toolchain, or new settings, and overwrite the data. If I reprocess the data, I should wholly rewrite my dataset with new program group/lineage information.

benedictpaten commented 10 years ago

@cassiedoll: You want the flexibility behind the API to create variant sets that are transient? - i.e. derived from a source dataset but not stored and therefore having no permanent UUID? What happens when the user wants to publish and reference this dataset? While convenient, I think this is antithetical to the goals of a storage API.

On Wed, Sep 10, 2014 at 1:59 PM, Frank Austin Nothaft < notifications@github.com> wrote:

@cassiedoll https://github.com/cassiedoll I get what you're saying and both agree and disagree. For variants, I think some things should be immutable. Specifically, once you've got a final (recalibrated) read set, you should be able to generate "canonical" genotype likelihoods from those reads. I agree with you that final genotype calls for a single sample will depend on the sample set that you joint call against, but fundamentally, that's not changing the genotype likelihoods, it's just changing the prior.

The correct approach IMO is to ensure immutability per program run/lineage; e.g., if I process a data set (with a specific toolchain and settings), I can't go back and reprocess part of that data with a new toolchain, or new settings, and overwrite the data. If I reprocess the data, I should wholly rewrite my dataset with new program group/lineage information.

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/142#issuecomment-55182137.

mcvean commented 10 years ago

There's clearly a need for versioned immutable data for reproducability. However, variant calls are an inference that may well change. Presumably we want to be able to support requests such as 'give me the genotype for this individual that would have been returned on this date'.

delagoya commented 10 years ago

@benedictpaten I don't think that is what @cassiedoll is getting at, but I'll let her reply.

The underlying alignments and variant calls for a given genomic sequence set will be context dependent and will change over time as it is re-analyzed. These changed result sets are new data sets. Data provenance is always an issue, but there are efforts in the use of runtime metadata to track data analysis workflows. I think that these other frameworks are sufficient for this request, and should be specified outside of the datastore API.

I am also a bit hard pressed to see how this can be easily implemented as part of the API without significant interface and use case changes. For example, how you would implement this as a formal part of the API (e.g. not just in the documentation) without requiring some time-based component into all of the API calls? Here time/date parameters are acting as a proxy for runtime metadata, so why not rely on metadata queries to get the proper result set?

benedictpaten commented 10 years ago

On Wed, Sep 10, 2014 at 2:19 PM, Gil McVean notifications@github.com wrote:

There's clearly a need for versioned immutable data for reproducability. However, variant calls are an inference that may well change.

Yes, they are inference, but that does not stop one from wanting to refer concretely to a set of inferences, even if subsequently they are changed/improved - it helps to untangle, as Paul Grosu nicely points out, the ingredients that led to a conclusion.

Presumably we want to be able to support requests such as 'give me the genotype for this individual that would have been returned on this date'.

Yes! - we could support that very easily by moving to an immutable system. —

Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/142#issuecomment-55184745.

richarddurbin commented 10 years ago

I think this conversation is confusing the API and the data store.

It may well be good practice to have data stores that store immutable objects. GA4GH can encourage that and the API should definitely support it.

But of course I should be allowed to use the API over transient representations that I make locally for exploratory or other purposes. We do this sort of thing all the time. Telling me that the fact of accessing a data set through the API means that it has to be permanent and immutable is crazy. Maybe I want to transiently consider alternative alignments and make new calls from them using standard GA4GH calling software - I should not be bound to store everything I ever do for ever.

So, I think Benedict's reasonable request concerns long term data stores, not the API as such.

Richard

On 10 Sep 2014, at 22:30, Benedict Paten notifications@github.com wrote:

On Wed, Sep 10, 2014 at 2:19 PM, Gil McVean notifications@github.com wrote:

There's clearly a need for versioned immutable data for reproducability. However, variant calls are an inference that may well change.

Yes, they are inference, but that does not stop one from wanting to refer concretely to a set of inferences, even if subsequently they are changed/improved - it helps to untangle, as Paul Grosu nicely points out, the ingredients that led to a conclusion.

Presumably we want to be able to support requests such as 'give me the genotype for this individual that would have been returned on this date'.

Yes! - we could support that very easily by moving to an immutable system. —

Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/142#issuecomment-55184745.

— Reply to this email directly or view it on GitHub.

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

diekhans commented 10 years ago

We believe that immutablility is essential for all data. The variant use case Cassie describes doesn't related to mutability but versioning. That is when and how long do you keep a given version of a data set.

One of the main tasks one does when they re-run an analysis is compare to a previous result, or maybe several previous runs. Each run would create a new set of immutable objects with unique ids. Once one decides on a final version, the previous versions could be deleted. Queries for those previous versions would return an error, possible with the unique id of the newest version.

This allows support for as many versions of data as need without the confusion of what version one is working with.

The huge advantage of Immutable is a computer science concept dating back to the 1950s that many of us are relearning. It greatly simplifies data management for both the producer and consumer of the data.

Not following the principle of all data being immutable and having a unique id is one of the major reasons behind the current bioinformatics data mess. The only way to make an experiment reproducible is to save all of the data files used and become the distributor of the data.

cassiedoll commented 10 years ago

I think I'm just coming from a slightly different world. Some of our customers over here don't have all of their data right now. Let's pretend that they are receiving one new BAM file a day. They might do the following:

upload that BAM file into our API. -- the read data is all immutable. -- if they want to realign the data, they would need to make a new readGroupSet. -- I think we all agree on this
call variants on that BAM file, which basically results in a new CallSet
merge that CallSet data into all the other CallSet data they have -- so every day, their VariantSet grows by one sample. So a bunch of Variants that had n Calls might now have n + 1 calls -- by 'merge' - I'm not talking anything fancy right now. let's just pretend we are using an exact equivalence here. if you have 2 CallSets which share exactly the same parent Variant (same name, pos, contig, etc etc etc) - so don't get distracted by this point :) -- @fnothaft - I think I agree with you here in that for a particular Call, its data isn't changing - but the Variant.calls field may get a new Call

I think generally though, our customers should have the right to do whatever they want to the data. What if they want to delete an entire VariantSet? In a perfectly immutable world, they wouldn't be able to. It may possibly ruin one of those provenance chains.

That's not okay with us though - that choice should be in the hands of our users. If I made some VariantSet, realized I had a small bug and called everything incorrectly - I should be allowed to delete it without having to prove that there aren't any users of that data. As a user, its my responsibility to insure that I'm not screwing up some downstream dependency - this should not be a burden on the API provider.

Let's additionally pretend that I had some new info tag I was messing around with. I should be able to run some analysis on my Variants, come up with my snazzy info tag, and store it back into the API. I shouldn't have to have a whole new VariantSet while I'm just running a bunch of test analysis on my data - and I also shouldn't have to resort to storing that test analysis in some random text file.

I could come up with many more examples here - but basically, this is the user's responsibility and should not be the job of API implementors who do not have all the necessary context.

cassiedoll commented 10 years ago

+1 to @richarddurbin

fnothaft commented 10 years ago

I think generally though, our customers should have the right to do whatever they want to the data. What if they want to delete an entire VariantSet? In a perfectly immutable world, they wouldn't be able to. It may possibly ruin one of those provenance chains.

That's not okay with us though - that choice should be in the hands of our users. If I made some VariantSet, realized I had a small bug and called everything incorrectly - I should be allowed to delete it without having to prove that there aren't any users of that data. As a user, its my responsibility to insure that I'm not screwing up some downstream dependency - this should not be a burden on the API provider.

I agree here; this may have been unclear in my earlier email, but I envision the data being immutable with the exception of delete. In-place update is disallowed, but delete is OK. Practically, you can't forbid delete; that just makes your data management problems worse...

To be realistic, we're not going to solve the "reproducibility crisis" by mandating immutability. However, we will significantly reduce our implementation flexibility, and as @cassiedoll is pointing out, this enforces pretty strict limitations on how the users of our system can manage their data. If you're using the GA4GH APIs to implement an archival datastore, sure, immutability makes sense: archival implies write once, read many times. If you're using the GA4GH APIs to access a scratch space (@cassiedoll's example of n + 1 calling), immutability may not be what you want.

diekhans commented 10 years ago

Hi Richard, immutability doesn't mean keeping data forever, it can be deleted, just like immutable object in memory can be garbage collected. It simply means that once an object is published with a unique id, it never changes. Any changes results in a new logical object with a new id

benedictpaten commented 10 years ago

+1 for @diekhans comment.

Concretely, consider adding a UUID to each of the container types, e.g. readGroup, readGroupSet, etc. The only rule is that the UUID is updated anytime the container changes in anyway.

For persistent storage APIs the UUID acts as a way of referencing a specific instance of a dataset. For transient stores the UUID could be NULL, if no mechanism for subsequent retrieval is provided, or it could be provided for caching purposes, with no guarantee that the instance will be retrievable for the long term.

To implement simple versioning, as with version control, a function could be provided which takes a UUID for a given container and returns any UUIDs that refer to containers directly derived from that container. An inverse of this function could also be provided. Given versioning, support for @mcvean's query would be a straight forward extension.

For sanity, we would probably want to add a query to distinguish between APIs instances that attempt to provide persistent storage from those that are naturally transient.

If we don't have reproducibility supported into our APIs we'll be either relying on convention, or worse, leading people to download all the datasets they use for publication and host them as files(!) for later research.

delagoya commented 10 years ago

I am in support of UUID meaning static data set. I am not in support of requiring stronger data versioning capabilities such as date/time parameters or requiring version tracking

cassiedoll commented 10 years ago

@benedictpaten your uuid could kinda be handled by the updated field we already have on our objects.

the updated field is always updated whenever an object is changed (just like your uuid). and an API provider could definitely provide a way to look up an object at some past time - but I'm with Angel in that we shouldn't require this functionality

fnothaft commented 10 years ago

I am in support of UUID meaning static data set. I am not in support of requiring stronger data versioning capabilities such as date/time parameters or requiring version tracking

+1 @delagoya; we should implement a genomics API, and avoid going down the rathole of building a version control system. If the presence of a UUID means that the dataset is static, our API is fine. UUID assignment is a metadata problem and we shouldn't tackle it in the reads/variants APIs.

cassiedoll commented 10 years ago

@fnothaft - metadata objects are coming to an API near you real soon now. #136 will definitely affect us.

so while I agree with your conclusion :) I do think we can't simply call it a metadata problem - cause metadata problems are now our problems, lol

benedictpaten commented 10 years ago

On Wed, Sep 10, 2014 at 5:26 PM, Angel Pizarro notifications@github.com wrote:

I am in support of UUID meaning static data set. I am not in support of requiring stronger data versioning capabilities such as date/time parameters or requiring version tracking

Great: static = immutable. I would not require version control either - just the potential for it as an optional, simple function. Consider @cassiedoll's example - where a user wants to make a small change to a dataset. A simple derivation function could be very useful to understanding data provenance and avoiding mess. I am not arguing we should go any further, or mandate it.

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/142#issuecomment-55203907.

fnothaft commented 10 years ago

@cassiedoll definitely. My point is just that as long as our read/variant API defines clear semantics for what it means for a UUID to be set/null (dataset is immutable/mutable), we can delegate (dataset level) UUID assignment to the metadata API.

pgrosu commented 10 years ago

I agree that the UUID approach, with a standard set of tests and data for validation/timing would suffice for now, but I have a simple question :) What if a drug company is required to keep an audit trail for everything that went into the development of the drug over many years. Part of that would be the whole NGS processing and analysis platform. This would mean reads, variants, pipelines, analysis processes (including results) and all validation along the way. This can mean repeated reprocessing of the same data through variations of the same pipeline - with different settings - on different dates for comparison and validation purposes. I know versioning is not something we want to explore now, but many commercial products support Versioned Data Management for good reason (i.e. Google's Mesa, Oracle, etc.). Is this something to be handled at a later time with a different version of the schema, or would it be beyond the scope of what we want to deliver?

fnothaft commented 10 years ago

What if a drug company is required to keep an audit trail for everything that went into the development of the drug over many years. Part of that would be the whole NGS processing and analysis platform. This would mean reads, variants, pipelines, analysis processes (including results) and all validation along the way. This can mean repeated reprocessing of the same data through variations of the same pipeline - with different settings - on different dates for comparison and validation purposes. I know versioning is not something we want to explore now, but many commercial products support Versioned Data Management for good reason (i.e. Google's Mesa, Oracle, etc.). Is this something to be handled at a later time with a different version of the schema, or would it be beyond the scope of what we want to deliver?

Others may disagree, but IMO, way beyond the scope of what we should want to deliver. It is difficult enough to build an application-specific data and processing environment end-to-end replication framework; I don't think it is tractable to build a "one-size-fits-all" end-to-end pipeline reproducibility solution that is blind to:

Compute environment (machine, OS, envar, local storage, application deployment configuration)
Source control environment for code
Versioning environment for data
Machine generated metadata required to be kept for auditing purposes
Authentication
Etc.

lh3 commented 10 years ago

Back to the very beginning of the thread (as I am reading it now). In the current non-API world, reproducibility is somehow achieved by data releases/freezes. Ensembl, UCSC, 1000g, hapmap and GRC, among the many other databases and projects, keep different releases of processed data for several years on FTP (PS: UCSC/Ensembl also keep complete live web interfaces to older data). In a paper, we just say what release number we are using. This approach is not ideal, but usually works well in practice.

From the discussion above, the scenario I am imagining is: a user can submit different releases of data over the time and request each release to be static/readonly. We keep different releases as independent CallSets, with versioning or not, for some time and then drop old ones gradually. Each released CallSet is referenced by UUID (or by accession number). Some call sets may be dynamic. Then they do not have UUID. In addition, for processed data like variants, we can afford to keep multiple releases. For raw data like read alignments, probably we wouldn't want to keep more than one release.

Is this what people are thinking of?

fnothaft commented 10 years ago

From the discussion above, the scenario I am imagining is: a user can submit different releases of data over the time and request each release to be static/readonly. We keep different releases as independent CallSets, with versioning or not, for some time and then drop old ones gradually. Each released CallSet is referenced by UUID (or by accession number). Some call sets may be dynamic. Then they do not have UUID. In addition, for processed data like variants, we can afford to keep multiple releases. For raw data like read alignments, probably we wouldn't want to keep more than one release.

+1, I generally agree.

In a paper, we just say what release number we are using. This approach is not ideal, but usually works well in practice.

Agreed, reproducibility is complex; I think documenting your setup/workflow, and ensuring that you can get the correct data gets you 95% of the way there. Documenting your setup/workflow is largely human factors engineering, so that's out of the scope of our API, but I think the UUID approach that was suggested above will address the getting the correct data problem.

<digression> I think a lot of people are doing good work with container based approaches, but smart folks are also making good points about the limitations of these approaches.

If you want to make it to 100% reproducibility, it is a hard but doable struggle. In a past life, I implemented an end-to-end reproducibility system for semiconductor design. Alas, it's not genomics, but there's a fair bit of cross-over between. We were able to build this system because we had complete control over:

The computing environment (OS version/installation, environment variable setup, disk mount points and network setup, etc)
The way users accessed data and scripts
Version control for: * Scripts to run tools * Tool installations \ All of the data

The system took over a year and a half to build with about 3-4 FTEs, took several prototypes, and was very application/environment specific. It was massive undertaking, but we could reproduce several year old protocols on several year old datasets with full concordance. Extreme reproducibility is a great goal, but is really hard to achieve, and a reads/variant access API is the wrong place to implement it. </digression>

richarddurbin commented 10 years ago

I am sympathetic about some of these ideas (see the PS), but still think that people are thinking in terms of properties of a data store rather than an API to compute with.

I'd like to think about the consequences of this proposal. If I want to calculate a new quality control metric for each call in a call set, and add it as a new key:value attribute in the info map, what would have to change? Will I end up with two complete call sets, and if I query will I get one by default, or if not, how will I know which one to use? How far up the chain does this go - do I get a new study when I add a new call set to the study?

What will be the time and memory implementation cost of this proposal? I am a bit concerned that we are losing sight of the fact that we need to deal at scale. Real systems need to handle within a year 100,000 full genome sequences - yesterday I was on a pair of calls where we have 32000 full genome sequences and are planning to impute into over 100,000 in the next 6 months. We won't switch to GA4GH unless it works better at that scale than what we have. I'd like some guidance from Google.
To what extent when thinking about computing on petabyte scale data structures do you think about formal desirability like putting uuids on each object, and requiring immutability, and to what extent do you think about the implementation being lean and restricted to what is required to deliver the goals.

My current position is still to think that this should be an optional add-on, not a required part of the design. Our primary goal should be to as cleanly and efficiently as possible access and compute on genomic sequence data. Other things should be optional extensions.

Richard

PS As it happens, I worked on a non-standard database system for genomic data 20 or so years ago called Acedb that also supported the ability to retrieve objects from arbitrary times in the past. It kept data in low level (typed) tree structures, and rather than deleting old or changed branches kept them in shadow mode that meant they were ignored in normal operations, but could be retrieved on request. Functionality a bit like TimeMachine on Macs, but I presume implemented differently. Anyway, it supported complete history within objects and was lightweight on normal function. (Interestingly, it was also used for a time by Intel for chip testing data.)

On 11 Sep 2014, at 05:07, Frank Austin Nothaft notifications@github.com wrote:

From the discussion above, the scenario I am imagining is: a user can submit different releases of data over the time and request each release to be static/readonly. We keep different releases as independent CallSets, with versioning or not, for some time and then drop old ones gradually. Each released CallSet is referenced by UUID (or by accession number). Some call sets may be dynamic. Then they do not have UUID. In addition, for processed data like variants, we can afford to keep multiple releases. For raw data like read alignments, probably we wouldn't want to keep more than one release.

+1, I generally agree.

In a paper, we just say what release number we are using. This approach is not ideal, but usually works well in practice.

Agreed, reproducibility is complex; I think documenting your setup/workflow, and ensuring that you can get the correct data gets you 95% of the way there. Documenting your setup/workflow is largely human factors engineering, so that's out of the scope of our API, but I think the UUID approach that was suggested above will address the getting the correct data problem.
I think a lot of people are doing good work with container based approaches, but smart folks are also making good points about the limitations of these approaches. If you want to make it to 100% reproducibility, it is a hard but doable struggle. In a past life, I implemented an end-to-end reproducibility system for semiconductor design. Alas, it's not genomics, but there's a fair bit of cross-over between. We were able to build this system because we had complete control over: The computing environment (OS version/installation, environment variable setup, disk mount points and network setup, etc) The way users accessed data and scripts Version control for: *\* Scripts to run tools *\* Tool installations *\* All of the data The system took over a year and a half to build with about 3-4 FTEs, took several prototypes, and was very application/environment specific. It was massive undertaking, but we could reproduce several year old protocols on several year old datasets with full concordance. Extreme reproducibility is a great goal, but is really hard to achieve, and a reads/variant access API is the wrong place to implement it.
— Reply to this email directly or view it on GitHub.

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

pcingola commented 10 years ago

Immutability is a sufficient condition for reproducibility, but not a necessary one. I prefer a 'data freeze' (as mentioned by @lh3) which seems leaner and faster for the scales we have in mind.

Create immutable duplicates of all variants just because we decided to add "Allele Frequency" information seems way too much burden, not to mention the fact that we have to either re-calculate or copy all data depending on each variant record (such as functional annotations).

The idea of "setting UUID" = "data freeze" could be implementable, but only for some "main" objects. As @richarddurbin mentioned, adding UUIDs to all objects seems unfeasible for the scales we have in mind. Setting UUID for each variant might be doable, but setting a UUID for each call in each variant is not efficient.

Pablo

P.S.: I also had the painful experience of implementing fully reproducible (financial) systems. My advice is the same as @fnothaft: Full reproducibility is a massive undertaking, don't go there.

fnothaft commented 10 years ago

+1 to @richarddurbin

@pcingola

P.S.: I also had the painful experience of implementing fully reproducible (financial) systems. My advice is the same as @fnothaft: Full reproducibility is a massive undertaking, don't go there.

Indeed; I'd also note that in the fully reproducible system I was working on, data wasn't immutable. The reason we had a fully reproducible system was so we could easily make critical engineering changes to products that had not been modified in several years. All data was versioned and had no delete option, and we had to eat the cost of keeping lots of extra disk around. So, a single snapshot of the database in time was immutable, but it needed to be easy to branch and update from any point in the database.

lh3 commented 10 years ago

I'd like to think about the consequences of this proposal. If I want to calculate a new quality control metric for each call in a call set, and add it as a new key:value attribute in the info map, what would have to change?

Adding new key-value pairs without touching the rest of data is complicated and fairly infrequent for released variant data.

Will I end up with two complete call sets, and if I query will I get one by default, or if not, how will I know which one to use? How far up the chain does this go - do I get a new study when I add a new call set to the study?

This is a good point. Some projects/databases solve this by providing a "latest-release" symbolic link on FTP. We could mimic this behavior in GA4GH such that older releases are not retrieved unless the user asks so by explicitly specifying UUIDs. This might need some lite versioning, though perhaps there are better solutions.

To what extent when thinking about computing on petabyte scale data structures do you think about formal desirability like putting uuids on each object, and requiring immutability, and to what extent do you think about the implementation being lean and restricted to what is required to deliver the goals.

Personally, I am thinking to just generate a UUID for a complete released CallSet (or equivalently VCF), not for smaller objects. For projects with a continuous flow of new samples, the current practice is still to set a few milestones for data releases. In publication, we do not often use transient data.

ekg commented 10 years ago

We should leave the versioning for the data storage layer. For instance, see the dat project, which aims to provide revision control and distributed data synchronization for arbitrarily-large files, databases, and data streams. We do not have to solve this problem. It already has a huge amount of attention in a more general context than genomics.

lh3 commented 10 years ago

No, dat is a completely different layer. I doubt ga4gh ever wants to go into this complexity. From my own point view, all I need is something roughly equivalent to data freezes that I can reference in my papers. It would be a disaster if the data used in my manuscript are dramatically changed and lead to a different conclusion during the review or immediately after the publication.

cassiedoll commented 10 years ago

@lh3 could the updated date work well enough for that? As long as the updated date is older than your paper, you'll know your data is still in its stale state.

(And if it isn't, you'll know someone changed something - and an API backend could choose to help you recover from this if they wanted to - in one of a hundred different ways :)

massie commented 10 years ago

Science requires reproducibility. We all agree on that.

Reproducibility is a hard problem that we don't need to tackle completely now. However, we don't want our APIs to get in the way when others (or we) decide to work on this later -- and they will.

We can also tackle this problem in pieces focusing on data verification first. Since we all know and use git, I'll use it as an example. Git is a content-addressable filesystem. If you have a git repo and I have a git repo checked out with the same SHA-1, we know we're both looking at the source code. That guarantee was designed into the system from the start.

While I'm not advocating that we build a full revision control system, I do think defining a standard for hashing over the (sub)content of our objects make sense. That hash should be stored and exposed (instead of a random UUID) to make it easy (and fast) to create hashes over sets of objects (since we don't want to recalculate them when sets are created).

This design would also allow developers to create tools (similar to git-fsck) to verify the connectivity and validate of our data. It also answers @richarddurbin question about how to handle reproducibility at scale. If your 100,000 genomes have the same GA4GH hash as mine, we know that we're operating on the same data.

@ekg, as an aside, another project similar to dat project is git-annex enable you to use git to track large binary files without checking them into git (just a symlink is used).

diekhans commented 10 years ago

Richard Durbin notifications@github.com writes:

I am sympathetic about some of these ideas (see the PS), but still think that people are thinking in terms of properties of a data store rather than an API to compute with.

This has to do with the semantics of the data model presented by API. How does the data change and what kind of life-cycle can one expect? There doesn't need to be a single policy for life-cycle for all data sets, however the API needs to be able to implement and express the behaviors.

To me, this is something that differentiates an API from a schema.

I think we confused things a bit in our description. Reproducibility is built on both immutability and versioning. Immutability gives a functional programming view of the data where all layers of the system can assume that the data doesn't change in arbitrary ways. This greatly simplifies programming and data management tasks.

Versioning is useful for more than archives, especially in an environment where one is experimenting with algorithms. However, the policy can vary on levels of persistence. For a lot of environments, only keeping the latest version make sense.

The API defining the unit of immutability is required for implementing versioning. For instance, it would be an insane about of overhead to version every read and it would of almost no value. Read group is a very logical immutable unit. Normally, it never changes.

Even if a give data source only keeps one version, it's simpler to have one API model that supports 1 or N versions rather than having it diverge. Even if it needs to diverge for efficiency reasons, the semantics needs to be defined as part of the API.

I'd like to think about the consequences of this proposal. If I want to calculate a new quality control metric for each call in a call set, and add it as a new key:value attribute in the info map, what would have to change?

A new version of the call set would be created. In a system that only supports one version, this just means assigning a new UUID and maybe recording that the old UUID has been replaced.

Will I end up with two complete call sets,

That depends on the underlying implementation

and if I query will I get one by default, or if not, how will I know which one to use?

There would be different types of queries. Probably the most common just returns the latest version. You can ask for specific versions via UUIDs, which might return an error if the version is not retained.

How far up the chain does this go - do I get a new study when I add a new call set to the study?

That depends on the data model, but I think it would be a bad design to have study -> callset be a strict containment relationship, rather than a relation that is queried. What happens now? If you add a new call set, does the modification time on the study change?

What will be the time and memory implementation cost of this proposal?

For a system that doesn't keep multiple versions, it should be very little difference from updating modification time. For a systems that does keep multiple versions, the immutability requirement facilitates copy-on-write operations, which makes new version cheap.

I am a bit concerned that we are losing sight of the fact that we need to deal at scale.

I am very concerned that the scalability of the read API in general. I have seen no performance analysis of the current API design. JSON encoding, while better than XML, is not efficient. Single-thread performance still matters.

My current position is still to think that this should be an optional add-on, not a required part of the design.

The important thing is the design needs to facility versioning by having an immutability as part of semantics. The implementation of versioning should be optional, but would not require a different API.

PS As it happens, I worked on a non-standard database system for genomic data 20 or so years ago called Acedb that also supported the ability to retrieve objects

Nice story!! Glad someone was thinking of it.

lh3 commented 10 years ago

@cassiedoll Typically we take a data freeze and work on that for months before the publication. The data are often analyzed multiple times in different but related ways. If I want to use APIs to access data, as is opposed to storing the data locally, I need to get the exact data every time during these months and preferably after the publication. As long as ga4gh can achieve this, I am fine. I don't really care about how it is implemented. Static CallSet is just a simple solution among the many possibilities.

pgrosu commented 10 years ago

@lh3, I understand all publications are precious to their respective authors, but if you look at the collection of data across all of them, then the publications are just blips across this gigantic, ever-growing, yet critical set of data/processed results - especially for clinical studies.

So a couple of years ago there was a publication on the comparison of 1000 Genomes with HapMap. By now this would probably be considered a small study. As @richarddurbin mentioned, what about 100,000 Genomes, what about 1 billion genomes? Will you freeze that for every variant that will be published for a specific study?

That's why I keep mentioning petascale (or larger) data-processing APIs with parallel algorithms and data-structures. Yes, we can have an API for sharing data, but will it scale? I posted #131 for a reason, referring to Google Mesa, Pregel, etc. to expand our approach. Many places either have that in-house, AWS, or some other "cloud" approach which seems to handle such throughput. Will we have this API targeted for the web, or just cloud-based data-centers where the transfer is "local"? So using this approach, will new key:values pairs - or a settings-change in the QC pipeline - propagate across all selected studies, thus generating a duplicate version of the studies within hours for comparison? Having silos of data-freezes might not always be conducive to fully integrated online, updated large-studies. What if the studies become so large that you have to duplicate variants that were made from a collection of reads across 10+ years. Which published version(s) of the silos at different sites should we select, and how should we integrate/update the data in a global variant dataset/study/project for a specific disease? I imagine, that something like the T2D (Type 2 Diabetes) studies at the Broad must be quite large. We're not talking about just a large data-store duplication, but an API that might have trouble handling the throughput to share that data, which should be ready to stream into processing/analysis pipelines.

diekhans commented 10 years ago

Yes, this is precisely the issue. Given a repository that saves versions of data, I have no idea how I could go about retrieving the data matching a given freeze using the GA4GH APIs. I don't thinks it's possible.

One is back to making snapshots of data and sharing them.

Heng Li notifications@github.com writes:

@cassiedoll Typically we take a data freeze and work on that for months before the publication. The data are often analyzed multiple times in different but related ways. If I want to use APIs to access data, I need to get the exact data every time during these months and preferably after the publication. As long as ga4gh can achieve this, I am fine. I don't really care about how it is implemented. Static CallSet is just a simple solution among the many possibilities.

— Reply to this email directly or view it on GitHub.*

vadimzalunin commented 10 years ago

Let me remind about the existing archives (yes, I work for one, therefore biased) that need to be compatible with the API, or indeed the other way around. If the existing SRA model is not drastically bad then maybe it should be used as the bases. To me the problem is two-fold:

Reads etc. some (most) objects are immutable, and others are provisional, for example pre-publication data. In rare cases the archives must be able to suppress/replace/kill data. I can't remember cases of replaced data but suppress and kill do happen. Shouldn't this propagate into the API?
Calls etc. incremental updates exposed as a separate (virtual) object linked to the origin. Implementations may choose to make copies instead but it should be abstracted from the API. Alternatively some may prefer to flip a series of increments into decrements but again this is implementation details.

TLDR:

enum status {DRAFT, FINAL, SUPPRESSED}
incremental VCF updates.

diekhans commented 10 years ago

Erik Garrison notifications@github.com writes:

We should leave the versioning for the data storage layer. For instance, see the dat project, which aims to provide revision control and distributed data synchronization for arbitrarily-large files, databases, and data streams. We do not have to solve this problem. It already has a huge amount of attention in a more general context than genomics.

Dat looks like an interesting system and I completely agree that GA4GH should not solve the problem. That isn't goal of this issue. GA4GH is defining APIs and the APIs should specify semantics that allow implementers to solve the versioning problem.

The current API definition is defining file system-like semantics that will make this harder.

ekg commented 10 years ago

@diekhans, would you elaborate on this a little?

The current API definition is defining file system-like semantics that will make this harder.

What specifically is the problem?

richarddurbin commented 10 years ago

I still don't understand.

I agree with Heng that in practice what we need is to have data freezes or snapshots.

Currently we do this by making a fixed copy of the data at the freeze. Clearly that works, but is inefficient. There can be more complex solutions that share unchanged objects. But I don't see how these change the user API. In either case I say right at the start when I open my connection that I want to access a named version of the data, then after that I just use the interface we have. I don't need to know additional uuids, or the semantics of the storage solution. It seems to me that all this discussion about immutable objects belongs in a layer that should be hidden from the user. The user I'd equally happy for snapshots to be copies of the whole data set, or things maintained by other solutions at the level of whole objects, or parts of objects. The API shouldn't care.

Richard

Sent from my iPhone

On 12 Sep 2014, at 17:31, Erik Garrison notifications@github.com wrote:

@diekhans, would you elaborate on this a little?

The current API definition is defining file system-like semantics that will make this harder.

What specifically is the problem?

— Reply to this email directly or view it on GitHub.

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

massie commented 10 years ago

We have different users here. We have end-users that want to use access a simple "named" version of the data without any knowledge of UUIDs, hashes or implementation details. We also have developers that want to build interesting tools for data management, syncing, verifying, and sharing (which end-users will ultimately use).

We need APIs with both group in mind: end-users and developers.

Currently we do this by making a fixed copy of the data at the freeze.

This is one problem that we want to solve since copying doesn't scale. Freezing petabytes of data isn't realistic. By having a content-addressable layer, we solve the issue of versioning/verification and minimize data movement between GA4GH teams that want to replicate data (and results).

This would be a great topic to discuss in person at our October meeting.

lh3 commented 10 years ago

I say right at the start when I open my connection that I want to access a _named version_ of the data, then after that I just use the interface we have.

@richarddurbin Currently ga4gh objects do not have stable names. They have IDs internal to each backend, but these IDs are not required to be stable by the schema. My understanding is that UUIDs proposed by others serve as stable names, though I more like the accession system nearly all biological databases are using.

massie commented 10 years ago

@lh3 Correct! We want to standardized the way we generate addresses for data content (at some fixed point in time).

A UUID is not the right tool to use here. While UUIDs are unique, they have no connection to content. Hashes (like SHA-1, SHA-2) are not just unique but also provide guarantees about the content of the underlying data (of course, hash collisions are possible but extremely rare).

End users wouldn't need to worry about the SHA-1 for their data (that's an implementation detail). They could just use names from a bio accession system that are then translated into content addresses (e.g. SHA-1). It is also imperative that these content addresses are the same across GA4GH backend implementations.

benedictpaten commented 10 years ago

On Fri, Sep 12, 2014 at 11:56 AM, Matt Massie notifications@github.com wrote:

@lh3 https://github.com/lh3 Correct! We want to standardized the way we generate addresses for data content (at some fixed point in time).

A UUID is not the right tool http://en.wikipedia.org/wiki/Universally_unique_identifier to use here. While UUIDs are unique, they have no connection to content. Hashes (like SHA-1, SHA-2) are not just unique but also provide guarantees about the content of the underlying data (of course, hash collisions are possible but extremely rare).

I agree this would enforce the connection, but computing the hashes might be computationally expensive?, hence the compromise to use UUIDs and the convention that each such id map to a unique, static version of the dataset. I am no expert here - and am not wedded to UUIDs.

End users wouldn't need to worry about the SHA-1 for their data (that's an implementation detail). They could just use names from a bio accession system that are then translated into content addresses (e.g. SHA-1). It is also imperative that these content addresses are the same across GA4GH backend implementations.

I like this idea. Quoting IDs (whatever form - even hashes/UUIDs) however is also a very precise, succinct way of referring to objects that does not require centralisation.

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/142#issuecomment-55445808.

mbaudis commented 10 years ago

@lh3

the accession system nearly all biological databases are using

But this is not a database. it is a format recommendation for an API; implementations, local naming schemas etc. may hugely differ. At least for metadata, we make the differences & suggest the use of UUID (all objects), localID, and accession.

@massie

... and for immutable objects (that is, most likely raw data, reads ... ?) you can add a hashedID. Many of the metadata objects (e.g. GAIndividual) will change content over time.

lh3 commented 10 years ago

For the purpose of referencing a "named version", I don't mind whether the stable name is a UUID or an accession. Nonetheless, accessions do have some advantages with some down sides. For example, when I see GOC0123456789.3, I would know this is the 3rd freeze (3; if we keep the version) of a Google (GO) GACallSet (C). It is more informative and more user friendly than 123e4567-e89b-12d3-a456-426655440000. It may also be more flexible.

I realize that perhaps I have not understood the UUID proposal when I see @benedictpaten talking about computational cost. I thought UUIDs for GACallSets are computed once and then stored in the database as stable names. I know we cannot store UUIDs for very small objects, but I do not care whether they have stable names or not.

diekhans commented 10 years ago

Heng Li notifications@github.com writes:

For example, when I see GOC0123456789.3, I would know this is the 3rd freeze (3; if we keep the version) of a Google (GO) GACallSet (C). It is more informative and more user friendly than 123e4567-e89b-12d3-a456-426655440000. It may also be more flexible.

Human readable ids do have value to human intuition and this shouldn't be ignored. UUIDs have value to computer algorithms managing data. They are more akin to a pointer or foreign key than a name. It leads to a lot of complexity to try combine the two into one value.

TCGA created a huge mess by trying to use barcodes, which encode metadata about the sample, as the primary, unique key. Barcodes are incredible valuable for the humans, they can scan lists of them quickly. However, it turned out that the metadata encoded in the barcodes was sometimes wrong and had to be changed, which you don't want to do with your unique identifier.

TCGA switch to using UUIDs as the primary key, with barcodes keep as a human readable description. This fixed a lot of problem.

For the details of TCGA barcode: https://wiki.nci.nih.gov/display/TCGA/TCGA+Barcode

The GA4GH APIs should provide for both a GUID and a name.

I realize that perhaps I have not understood the UUID proposal when I see @benedictpaten talking about computational cost. I thought UUIDs for GACallSets are computed once and then stored in the database as stable names.

This comes from Matt's proposal to use SHA1 hashes as a GUID as instead of UUIDs. Either approach provides a unique 128-bit number. SHA1s can be recomputed from the data and use to validated the data against the GUID. UUIDs are very easy and cheap to create. It's entirely possible to use both, depending on the type of object.

It is important that the API not impose implementation details on the data provider. One really wants creating a new version to be implementable with very fast copy-on-write style of algorithms. Needing to compute a hash may preclude this implementation.

Defining the API to have opaque 128-bit GUIDs allows the data providers to trade off UUIDs vs SHA1s as the implementation.

massie commented 10 years ago

One thing we need to keep in mind: composability.

UUIDs are not composable whereas hashes are. In addition, hashes are helpful to prevent duplication of data (the same data has the same hash) whereas UUIDs do not (the same data could be stored with the same UUID).

For example, let's say that we have a set object that contains 10 "foo" objects and each "foo" has a calculated hash field. To create the hash for the set only requires a quick merge of the 10 hashes (instead of rerunning the hash over all the 10 "foo" objects data). This is the power of composability. The hash that is calculated would be the same across all GA4GH repositories.

Performance is something we need to consider, of course.

Here's a code snippet for people to play with if you like (of course, you'll want to change the FileInputStream path)...

package  example
import java.io.FileInputStream
import java.security.MessageDigest

object Sha1Example {

  def main(args: Array[String]): Unit = {
    val start = System.currentTimeMillis()
    val sha1 = MessageDigest.getInstance("SHA1")
    val bytes = new Array[Byte](1024 * 1024)
    val fis = new FileInputStream("/workspace/data/ALL.chr22.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz")
    Stream.continually(fis.read(bytes)).takeWhile(-1 !=).foreach(sha1.update(bytes, 0, _))
    val hashBytes = sha1.digest()

    // covert the hash bytes into a more human-readable form...
    val sb = new StringBuffer()
    hashBytes.foreach { a => sb.append(Integer.toString((a & 0xff) + 0x100, 16).substring(1))}
    val end = System.currentTimeMillis()
    println(sb.toString)
    println("%d ms".format(end - start))
  }

Output on my MacBook:

1a50d065799c4d32637dbe11eb66e5f1e8b35b89
9570 ms

On my MacBook Pro, I was able to hash a ~2GB file at about ~180MB/s (single-threaded, single flash disk). This is just a very rough example and shouldn't be seen as a real benchmark. I just wanted to explain with working code since it's a language we all understand. Note: I also confirmed the hash using the shasum commandline utility.

Since hashes are composable, it's very easy to distribute the processing for performance too. Keep in mind, we will never have to recalculate a hash of data. It is calculated once, stored and composed for sets of objects.

delagoya commented 10 years ago

Caveat emptor: hashes composed of other hashes are highly dependent on the order of supplied component hashes. There will be no guarantee that a particular data store will implement the hash ordering in exactly the same way as others.

This may seem trivial, but I've met enough sorting problems in bioinformatics in my time that it should not be treated as a trivial concern.

On Sep 13, 2014, at 7:14 PM, Matt Massie notifications@github.com wrote:

One thing we need to keep in mind: composability.

UUIDs are not composable whereas hashes are. In addition, hashes are helpful to prevent duplication of data (the same data has the same hash) whereas UUIDs do not (the same data could be stored with the same UUID).

For example, let's say that we have a set object that contains 10 "foo" objects and each "foo" has a calculated hash field. To create the hash for the set only requires a quick merge of the 10 hashes (instead of rerunning the hash over all the 10 "foo" objects data). This is the power of composability. The hash that is calculated would be the same across all GA4GH repositories.

Performance is something we need to consider, of course.

Here's a code snippet for people to play with if you like (of course, you'll want to change the FileInputStream path)...

package example import java.io.FileInputStream import java.security.MessageDigest

object Sha1Example {

def main(args: Array[String]): Unit = { val start = System.currentTimeMillis() val sha1 = MessageDigest.getInstance("SHA1") val bytes = new Array[Byte](1024 * 1024) val fis = new FileInputStream("/workspace/data/ALL.chr22.phase1_release_v3.20101123.snps_indelssvs.genotypes.vcf.gz") Stream.continually(fis.read(bytes)).takeWhile(-1 !=).foreach(sha1.update(bytes, 0, )) val hashBytes = sha1.digest()
// covert the hash bytes into a more human-readable form...
val sb = new StringBuffer()
hashBytes.foreach { a => sb.append(Integer.toString((a & 0xff) + 0x100, 16).substring(1))}
val end = System.currentTimeMillis()
println(sb.toString)
println("%d ms".format(end - start))
} Output on my MacBook:

1a50d065799c4d32637dbe11eb66e5f1e8b35b89 9570 ms On my MacBook Pro, I was able to hash a ~2GB file at about ~180MB/s (single-threaded, single flash disk). This is just a very rough example and shouldn't be seen as a real benchmark. I just wanted to explain with working code since it's a language we all understand. Note: I also confirmed the hash using the shasum commandline utility.

Since hashes are composable, it's very easy to distribute the processing for performance too. Keep in mind, we will never have to recalculate a hash of data. It is calculated once, stored and composed for sets of objects.

— Reply to this email directly or view it on GitHub.

ga4gh / ga4gh-schemas

GA4GH APIs need to address scientific reproducibility (propose immutable datatypes) #142