VariantAnnotationSet dataset scope not defined

diekhans commented 8 years ago

VariantAnnotationSet objects don't explicitly have a dataset id and their relationship to datasets is not documented. This results in the assumption that a VariantAnnotationSet is in the same dataset as the VariantSet it is annotating.

If this restriction is intended, it needs to be documented.

However, there are use cases for VariantAnnotationSets being in different datasets than the VariantSets they annotate. If the group doing the annotation is independent of the group producing the variants, they belong in separate datasets.

For example, if I was doing new annotation of 1K genomes, it would have to copy 1K genomes data into my dataset to publish it via GA4GH. This is a big disadvantage when dataset is the unit billing for storage.

jeromekelleher commented 8 years ago

Unless you're thinking of a multi-tenant system @diekhans, I don't think this makes any difference.The variant set must be locally hosted, or the VariantSetId is out of scope and meaningless. If the VariantSet is locally hosted, then a shallow copy is trivial to do.

I would have thought that the "AnnotationSet is within a given VariantSet relationship" is a perfectly reasonable way to model it. In your example of making annotations of a remote copy of 1000 Genomes, why not create a shallow local version of the 1000 Genomes VariantSet, and put your annotations in that? In the reference server, we could create a remote backing URL for the VariantSet that is a GA4GH endpoint. That's essentially free bandwidth wise unless you want to pull over the variants (in which case, you'll be paying anyway).

There are definitely advantages to having a tree of object ownerships/containment rather than a graph. I think there should be an overwhelmingly good and practical reason for breaking the tree, or we'll be kicking ourselves later when we're trying to do access permissions.

diekhans commented 8 years ago

My impressions is that variant annotations to variants are a reference relationship, not a containment relationship. It's a hugely restrictive system if you can only reference objects within a single datasets. The API already relies on references outside of a data set such as reference genomes, and feature annotations

I just don't see how we can have a systems that has explicit object linkage and not have cross dataset references.

We don't have shallow copy semantics in dataset, even if the underlying storage allows it. A copy becomes a different object with a different ids, breaking other references to those objects.

jeromekelleher commented 8 years ago

Fair enough @diekhans. Lets see what the VA folks think.

ga4gh / ga4gh-schemas

VariantAnnotationSet dataset scope not defined #615