chop-dbhi / varify

Clinical DNA Sequencing Analysis and Data Warehouse
BSD 2-Clause "Simplified" License
24 stars 8 forks source link

Implement Versioning of Annotation Data #161

Open mitalia opened 10 years ago

mitalia commented 10 years ago

In a clinical lab, one of the challenges is keeping tabs on which annotation data changes. The worst scenario is running a query one day and not knowing why the same query returns different results the next day. In order to balance the need to update annotation data with the need for some level of stability, I propose the following approach (elements of which we have):

1) Public annotation data should be rolled up into a single, easily-referenced Varify annotation version. In this way, analyses can simply reference the annotation dataset that was used with a simple version number and not have to track each and every data source independently. This version and what it contains should be easily available from within the Varify application.

2) Public data sources should be refreshed on a schedule that we can communicate to end-users (i.e. the last week of the month, or quarterly). Not every annotation source needs to be changed from version to version, but regardless the master Varify annotation version should increment.

3) One very real scenario that crops up is the following: a patient is tested and no variants of significance are found. Several years later, the patient gets another genetic test that identifies a mutation causative for disease. The question then comes up as to why this variant was "missed" the first time. In order to rule out error, it is often necessary to view the public data to determine whether or not the information on the variant at the original time of analysis was sufficient. For example, perhaps the allele frequency in a public resource like 1000 genomes was different and did not meet a filter criterion. While this scenario is not common, it does happen. To allow for such an audit, when Varify updates the master database of annotations, I propose the Variant table be serialized to a data structure (JSON?) that can be compressed and archived. Basically, each unique variant would be the key to a data structure representing all public data associated with it. This serialized version could be independently processed outside the application to support the audit.

Note that there is one source of change over time that is not addressed by the scenarios outlined above and that is internal cohort frequencies. Cohort frequencies will be addressed in another ticket.

davecap commented 10 years ago

We'd love to work on this problem with you. We're aiming to have our datasets be updated on a frequency that matches the source data (so sometimes daily, weekly, monthly, etc...). Each of our datasets is versioned following the semantic versioning system (0.0.1, 0.0.2...) and the data within them never changes. The idea is that you can easily swap in and out different versions of the same dataset.

We've also been thinking about a way to "roll up" a set of datasets in a packages that can be used by different applications (kind of like point 1 above, but only for data in SolveBio). Of course, we won't be storing your queries for privacy reasons, so keeping audit trails is something you will want to do internally.

mitalia commented 10 years ago

@davecap Thanks, I was thinking of your versioning when writing the ticket. I have to think about what we'd want from data we get from you guys. Part of me wants the "nuclear bunker" option of having it all archived in a tarball locally in case you pivot into a social media company or something :). On the other hand, it sure would be nice not to have to maintain all these local archives long-term.

davecap commented 10 years ago

A local copy is probably the way to go for now (ie full query result logging), but we are open to working with you guys on that as well. One of the features on our roadmap is encrypted query audit trail for cases exactly like this. The idea is that we would encrypt and store all or some queries and their result sets and allow you to export them whenever you want.

davecap commented 10 years ago

We're going to make a simple "django-solvebio" package that will allow you to control local aliases (stored in your DB) of versioned datasets on SolveBio. For example, you can say that the local alias for "ClinVar" will hit our 1.0.0 version. That should let you make changes without re-deploying Varify.