ga4gh-beacon / specification

GA4GH Beacon specification.
Apache License 2.0
32 stars 25 forks source link

Add support to ask for more types of variants (more complex InDels and duplications) #20

Closed mfiume closed 7 years ago

mfiume commented 8 years ago

Proposal by Michael Baudis, please elaborate if insufficient.

For example: INS[ATGC]+ DEL[0-9]* DUP

Discussion on interpretation and use cases was already started in this document: https://docs.google.com/document/d/1PfSt0o0m59BRs92PtyDcP31fUl8QgMYSTiclHAXCG0s/edit?usp=sharing

mcupak commented 8 years ago

Comments from the Google Doc for reference:

Miro Cupak (4:03 PM Mar 28): The description [of alternateBases] refers to the VCF spec. Is there ambiguity?
Michael Baudis (4:11 PM Mar 28): DEL (or <DEL>); DUP (or <DUP>) ...
Heinz Stockinger (8:02 AM Mar 29): ALT field looks ok. We might even consider to add the INFO field such as "ALT;INFO" (i.e use ";" to separate ALT and INFO). Then we can have examples such as: "<DUP>;SVTYPE=DUP;END=12686200;SVLEN=21100;CIPOS=-500,500;CIEND=-500,500"
heinzstockinger commented 8 years ago

+1 to include that in version 0.4

antbro commented 8 years ago

+1

Heinz Stockinger wrote:

+1 to include that in version 0.4

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ga4gh/beacon-team/issues/20#issuecomment-225908638, or mute the thread https://github.com/notifications/unsubscribe/AI_EVCKdQWRzPP7H4O182LALcUM5H8PGks5qLsEogaJpZM4IDB-p.

jrambla commented 8 years ago

+1

On Tue, 14 Jun 2016 at 16:57 antbro notifications@github.com wrote:

+1

Heinz Stockinger wrote:

+1 to include that in version 0.4

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ga4gh/beacon-team/issues/20#issuecomment-225908638,

or mute the thread < https://github.com/notifications/unsubscribe/AI_EVCKdQWRzPP7H4O182LALcUM5H8PGks5qLsEogaJpZM4IDB-p .

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ga4gh/beacon-team/issues/20#issuecomment-225908985, or mute the thread https://github.com/notifications/unsubscribe/AHsiOlqtZEFzMel7QoN5TzSX44Fw4ZtCks5qLsFngaJpZM4IDB-p .

ddtxra commented 8 years ago

+1

sdelatorrep commented 8 years ago

+1

mbaudis commented 8 years ago

Obviously +1 on this. Additional elaboration:

The CNV/CNA space (basic description of regional copy number imbalances vs. standard reference genomes) is a supremely suitable first extension of the current variant representation schema:

I am not overly concerned regarding specific privacy issues. Obviously, any additional datapoint in principle can provide a point of attack for re-identification attempts. However, the number of rare CNVs per sample is comparatively low; it is not trivial to query base-specific CNV boundaries (and those may freq. be approximate); somatic CNV/CNA (e.g., cancer) are currently not considered critical (see e.g. ICGC, where computed copy number is fully open).

Anyway, the evaluation of possible re-identification issues is deferred to the implementer of the Beacon resource.

The only open issues right now are IMHO specifics, e.g. how overlap queries & imprecise boundaries are defined & implemented, as well as how to query/return CN levels (i.e. granular, integer options beyond DUP/DEL).

heinzstockinger commented 8 years ago

Hello Michael, Could you please provide an example of what you mean by "... how to query/return CN levels (i.e. granular, integer options beyond DUP/DEL)." Thanks, Heinz

mbaudis commented 8 years ago

@heinzstockinger Copy number variations can have different quantitative levels. Based on a 2n allele count, deletions can lead to 1n or 0n (homozygous). For duplications there is no upper limit; in cancer genomes, amplicons with hundreds of repeats of the same sequence can be found (sometimes including one or more complete CDRs; an example here is MYCN).

There are reasons to query specific copy number levels, e.g. to find only homozygous deletions.

The VCF file format allows to provide this information through FORMAT => CN ("Copy number genotype for imprecise events"); see pp. 13/14 of VCF 4.3.

Calling numerically correct copy numbers is difficult (especially in cancer w/ mixed cellularity etc.), and frequently data contains just DUP/DEL information instead of integer count values, with the possible addition of HOMODEL (i.e. 0n) and AMP (i.e. passing a arbitrary threshold, e.g ≧ 4).

While there are clearly use cases for this kind of granularity, implementation adds some complexity which makes only sense when there are repositories actually providing this type of data & not only the theoretical urge to do so (e.g. while we work on this for arraymap.org, integer CN calls are not implemented yet).

Conclusion:

sduvaud commented 8 years ago

In order to be consistent, we should have:

INS[ATGC]+ DEL[0-9] DUP[0-9]

heinzstockinger commented 8 years ago

+1

antbro commented 8 years ago

I don't understand the issue If I'm not the only one, perhaps chat it through in a Beacon TC ? Tony

Heinz Stockinger wrote:

+1

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ga4gh/beacon-team/issues/20#issuecomment-230427033, or mute the thread https://github.com/notifications/unsubscribe/AI_EVAYvkq8YFxR2U54CKiLrn7ArH0RGks5qSiAzgaJpZM4IDB-p.

heinzstockinger commented 8 years ago

It's just a small change with respect to the proposal at the top of the page, i.e., we had:

INS[ATGC]+ DEL[0-9]* DUP

we now propose to update it to:

INS[ATGC]+ DEL[0-9] DUP[0-9]

i.e. only adding [0-9]* to DUP - so both, DUP and DEL have the possibility for integer values (as it is already the case in the current v0.3 and earlier specifications).

antbro commented 8 years ago

Aha - thanks!

+1

T

Heinz Stockinger wrote:

It's just a small change with respect to the proposal at the top of the page, i.e., we had:

INS[ATGC]+ DEL[0-9]* DUP

we now propose to update it to:

INS[ATGC]+ DEL[0-9] DUP[0-9]

i.e. only adding [0-9]* to DUP - so both, DUP and DEL have the possibility for integer values (as it is already the case in the current v0.3 and earlier specifications).

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ga4gh/beacon-team/issues/20#issuecomment-230438376, or mute the thread https://github.com/notifications/unsubscribe/AI_EVKHqydasshsiGIgqaiivo-RRUvuXks5qSixDgaJpZM4IDB-p.

antbro commented 8 years ago

Hi All

I'd like to reflect on some basics about 'what is a Beacon', and hence thereafter decide on what to turn it into...

Currently Beacon asks about 'a single base allele'. Originally this meant (1) "any subject-specific record where the query allele is present" (whether heterozygous or homozygous)

But we now allow people to use Beacon to ask (2) "any database record referring to the query allele" (could be population frequency data, protein structure or pathogenicity consequences, animal model correlates, etc)

We decided in Hinxton last week that both (1) and (2) are acceptable

Regarding queries that focus on records about the properties of human subjects (as opposed to the properties of variants) we have never yet tried to enable queries to distinguish between standard genotypes (homozygous or heterozygous presence of the query allele), but if we did this could quickly expand into asking about zygosity generally (e.g., hemizygous, Y chm markers or X markers in females, polyploidy, etc) ...and that would open a way into a general solution for genome counts of an allele (which could be fractional, ranges, >, < etc) ...which then, if openned up to variants other than single base changes, provides a way to handle copy number variation.

In parallel we'd also want a way to specify a query on a local haplotype (i.e., one chm rather than one genome level)

SO I PROPOSE WE ENABLE QUERIES THAT SPECIFY CLEANLY AND SIMPLY

Beyond this, I'd like to see the Beacon query language able to ask but whether data/records exist that relate to a genome region defined by a start and stop base (which could be one and the same), and how those data/annotations match the target region (exact|exceed|begin_between|end_between|begin_and_end_between|only_begin_between|only_end_between|begin_at_start|end_at_stop)

Plus a search option for specific sequence strings

Then one could easily imagine combinations of the above, eg:

'SNP allele' located 'exact' at 'Chm:2/start_base:6543' and 'Chm:2/stop_base:6543'

'SNP allele' located 'exact' at 'Chm:2/start_base:6543' and 'Chm:2/stop_base:6543' with 'Count_In_Genome > 0.5'

'TTAGGAGG' located 'begin_between' 'Chm:2/start_base:6543' and 'Chm:2/stop_base:6553'

'Copy number variant allele X' with 'Count_In_Genome > 4'

'Copy number variant allele X' with 'Count_In_haplotype > 2' where haplotype at 'Chm:2/start_base:5,000' and 'Chm:2/stop_base:100,000'

Thoughts...? Tony

Michael Baudis wrote:

@heinzstockinger https://github.com/heinzstockinger Copy number variations can have different quantitative levels. Based on a 2n allele count, deletions can lead to 1n or 0n (homozygous). For duplications there is no upper limit; in canver genomes, amplicons with hundreds of repeats of the same sequence can be found (sometimes including one or more complete CDRs; an example here is MYCN).

There are reasons to query specific copy number levels, e.g. to find only homozygous deletions.

The VCF file format allows to provide this information through |FORMAT| => |CN| ("Copy number genotype for imprecise events"); see pp. 13/14 of VCF 4.3.

Calling numerically correct copy numbers is difficult (especially in cancer w/ mixed cellularity etc.), and frequently data contains just DUP/DEL information instead of integer count values, with the possible addition of HOMODEL (i.e. 0n) and AMP (i.e. passing a arbitrary threshold, e.g ≧ 4).

While there are clearly use cases for this kind of granularity, implementation adds some complexity which makes only sense when there are repositories actually providing this type of data & not only the theoretical urge to do so (e.g. while we work on this for arraymap.org, integer CN calls are not implemented yet).

Conclusion:

* At least for 0.4 implement qualitative DUP/DEL calls.
* Keep in mind a future extensibility towards integer CN thresholding.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ga4gh/beacon-team/issues/20#issuecomment-226687963, or mute the thread https://github.com/notifications/unsubscribe/AI_EVDjBma1psFWBHbckkhpIsSSpoKOXks5qMjmigaJpZM4IDB-p.

mbaudis commented 8 years ago

@antbro Maybe you move this to a separate doc which can be edited/commented on? I think it would be best to have the specific use cases listed & discussed, which is tricky with this format here on Github.

(overall your examples are in my line of thinking)

mbaudis commented 7 years ago

For info linked here, a write-up of options for range queries, using a VCF:INFO approach (but pointing to alternative use of other attributes):

https://docs.google.com/document/d/1uePLlLMl0FzxZxDrsF9IxsC2nYvZ84029fzUD1ULNWI/edit#

jrambla commented 7 years ago

I was in believe that in version 0.4 we will implement complex variants like the ones in the document https://docs.google.com/document/d/1uePLlLMl0FzxZxDrsF9IxsC2nYvZ84029fzUD1ULNWI Am I wrong?

mcupak commented 7 years ago

I believe the decision made on today's call was to implement the first step as described in https://github.com/ga4gh/beacon-team/issues/20#issuecomment-230438376 and put off the changes proposed in the document above to a later time.

heinzstockinger commented 7 years ago

There would just be the additional field to add: alternateBasesInfo. It's an optional parameter so it is completely backwards compatible.

Details are in the following pull request: https://github.com/ga4gh/beacon-team/pull/65

mcupak commented 7 years ago

To summarize related decisions made during the workshop yesterday:

We're going with https://github.com/ga4gh/beacon-team/pull/94 over https://github.com/ga4gh/beacon-team/pull/95 as the base for implementation.

mbaudis commented 7 years ago

@mcupak I've added a comment to the DUP,DEL... PR https://github.com/ga4gh/beacon-team/blob/develop-proto-structural_and_ranges/src/main/proto/ga4gh/beacon.proto#L50

Actually, on re-reading VCF the reference value can stay "required", since values of A,C,G,T,N are permitted (this is conceptually slightly different from a . as recommended for a missing value, but practically the same).

Is this sufficiently verbose?

  // Reference bases for this variant (starting from `start`).
  //
  // Accepted values: see the REF field in VCF 4.2 specification
  // (https://samtools.github.io/hts-specs/VCFv4.2.pdf).
  // When querying for variants without specific base alterations (e.g.
  // imprecise structural variants with separate variant_type as well as
  // start_min & end_min ... parameters), the use of a single "N" value is
  // recommended.
  string reference_bases = 8;
mbaudis commented 7 years ago

Closing since implemented in develop-proto branch.