ga4gh / ga4gh-schemas

Models and APIs for Genomic data. RETIRED 2018-01-24
http://ga4gh.org
Apache License 2.0
214 stars 114 forks source link

Can a callset be in multiple variant sets? #583

Open david4096 opened 8 years ago

david4096 commented 8 years ago

Callsets can be a member of multiple variant sets according to the schema, yet the reference server is currently underspecified for this case. Is there an example of when a callset is in multiple variant sets?

  /** The IDs of the variant sets this call set has calls in. */
  array<string> variantSetIds = [];

https://github.com/ga4gh/schemas/blob/master/src/main/resources/avro/variants.avdl#L90

SearchCallSetsRequest requires a single variant set ID to be specified, making the above semantics even more strange. If a callset can be a member of multiple variant sets, why do we specify a single variant set ID when performing search?

record SearchCallSetsRequest {
  /**
  The VariantSet to search.
  */
  string variantSetId;

https://github.com/ga4gh/schemas/blob/master/src/main/resources/avro/variantmethods.avdl#L158

mbaudis commented 8 years ago

@david4096 Logically, Callsets should only refer to one Variantset, since they can be thought of as an ordered list with the length == no. of variants described. So Variantsets would have to be identical, to be referred from Callsets.

jacmarjorie commented 8 years ago

Our group has this use case. Without starting a conversation on what the correct definition of a VariantSet is/should be, the CallSet variantSetId list makes sense if you follow exactly what the definitions suggest:

VariantSet definition:

A VariantSet is a collection of variants and variant calls intended to be analyzed together.

CallSet definition:

A CallSet is a collection of calls that were generated by the same analysis of the same sample.

Use case: compare CallSetX to other CallSets belonging to VariantSetA, and compare CallSetX to other CallSets belong to VariantSetB, but not all CallSets from both VariantSetA and B, since by definition VariantSetA and VariantSetB are not meant to be analyzed together. The CallSet can belong to VariantSetA and VariantSetB in order to avoid duplication of this CallSet.

By this I would say, CallSets belonging to one VariantSet in the reference server is a bug.

jacmarjorie commented 8 years ago

Also,

record SearchCallSetsRequest {
  /**
  The VariantSet to search.
  */
  string variantSetId;

If above is the agreed upon definition, then variantSetId in the CallSetRequest should be variantSetIds list, not a string.

diekhans commented 8 years ago

@jacmarjorie

I believe your use cases is what was imagined. However there is a multi-month discussion, that was never resolved, on if this should be supported:

https://github.com/ga4gh/schemas/pull/395

We would love contributions to the documentation on variants, including documenting use cases justifying the design:

https://github.com/ga4gh/schemas/issues/408 https://github.com/ga4gh/schemas/issues/379

Variants is suffering from no one who has a deep understanding of variants and VCF analysis owning finishing the work.

Mark

Jaclyn Smith notifications@github.com writes:

Our group has this use case. Without starting a conversation on what the correct definition of a VariantSet is/should be, the CallSet variantSetId list makes sense if you follow exactly what the definitions suggest:

VariantSet definition:

A VariantSet is a collection of variants and variant calls intended to be analyzed together.

CallSet definition:

A CallSet is a collection of calls that were generated by the same analysis of the same sample.

Use case: compare CallSetX to other CallSets belonging to VariantSetA, and compare CallSetX to other CallSets belong to VariantSetB, but not all CallSets from both VariantSetA and B, since by definition VariantSetA and VariantSetB are not meant to be analyzed together. The CallSet can belong to VariantSetA and VariantSetB in order to avoid duplication of this CallSet.

By this I would say, CallSets belonging to one VariantSet in the reference server is a bug.

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub*

diekhans commented 8 years ago

related to https://github.com/ga4gh/schemas/pull/395 and https://github.com/ga4gh/schemas/pull/412

david4096 commented 7 years ago

https://github.com/ga4gh/ga4gh-schemas/blob/master/src/main/proto/ga4gh/variants.proto#L75

Callsets are still allowed to be in multiple variant sets. We should remove this. The biosample ID tag on callsets is what allows you to compare calls in multiple variant sets.