Closed mfiume closed 7 years ago
Comments from the Google Doc for reference:
Miro Cupak (4:03 PM Mar 28): The description [of alternateBases] refers to the VCF spec. Is there ambiguity?
Michael Baudis (4:11 PM Mar 28): DEL (or <DEL>); DUP (or <DUP>) ...
Heinz Stockinger (8:02 AM Mar 29): ALT field looks ok. We might even consider to add the INFO field such as "ALT;INFO" (i.e use ";" to separate ALT and INFO). Then we can have examples such as: "<DUP>;SVTYPE=DUP;END=12686200;SVLEN=21100;CIPOS=-500,500;CIEND=-500,500"
+1 to include that in version 0.4
+1
Heinz Stockinger wrote:
+1 to include that in version 0.4
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ga4gh/beacon-team/issues/20#issuecomment-225908638, or mute the thread https://github.com/notifications/unsubscribe/AI_EVCKdQWRzPP7H4O182LALcUM5H8PGks5qLsEogaJpZM4IDB-p.
+1
On Tue, 14 Jun 2016 at 16:57 antbro notifications@github.com wrote:
+1
Heinz Stockinger wrote:
+1 to include that in version 0.4
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ga4gh/beacon-team/issues/20#issuecomment-225908638,
or mute the thread < https://github.com/notifications/unsubscribe/AI_EVCKdQWRzPP7H4O182LALcUM5H8PGks5qLsEogaJpZM4IDB-p .
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ga4gh/beacon-team/issues/20#issuecomment-225908985, or mute the thread https://github.com/notifications/unsubscribe/AHsiOlqtZEFzMel7QoN5TzSX44Fw4ZtCks5qLsFngaJpZM4IDB-p .
+1
+1
Obviously +1 on this. Additional elaboration:
The CNV/CNA space (basic description of regional copy number imbalances vs. standard reference genomes) is a supremely suitable first extension of the current variant representation schema:
I am not overly concerned regarding specific privacy issues. Obviously, any additional datapoint in principle can provide a point of attack for re-identification attempts. However, the number of rare CNVs per sample is comparatively low; it is not trivial to query base-specific CNV boundaries (and those may freq. be approximate); somatic CNV/CNA (e.g., cancer) are currently not considered critical (see e.g. ICGC, where computed copy number is fully open).
Anyway, the evaluation of possible re-identification issues is deferred to the implementer of the Beacon resource.
The only open issues right now are IMHO specifics, e.g. how overlap queries & imprecise boundaries are defined & implemented, as well as how to query/return CN levels (i.e. granular, integer options beyond DUP/DEL).
Hello Michael, Could you please provide an example of what you mean by "... how to query/return CN levels (i.e. granular, integer options beyond DUP/DEL)." Thanks, Heinz
@heinzstockinger Copy number variations can have different quantitative levels. Based on a 2n allele count, deletions can lead to 1n or 0n (homozygous). For duplications there is no upper limit; in cancer genomes, amplicons with hundreds of repeats of the same sequence can be found (sometimes including one or more complete CDRs; an example here is MYCN).
There are reasons to query specific copy number levels, e.g. to find only homozygous deletions.
The VCF file format allows to provide this information through FORMAT
=> CN
("Copy number genotype for imprecise events"); see pp. 13/14 of VCF 4.3.
Calling numerically correct copy numbers is difficult (especially in cancer w/ mixed cellularity etc.), and frequently data contains just DUP/DEL information instead of integer count values, with the possible addition of HOMODEL (i.e. 0n) and AMP (i.e. passing a arbitrary threshold, e.g ≧ 4).
While there are clearly use cases for this kind of granularity, implementation adds some complexity which makes only sense when there are repositories actually providing this type of data & not only the theoretical urge to do so (e.g. while we work on this for arraymap.org, integer CN calls are not implemented yet).
Conclusion:
In order to be consistent, we should have:
INS[ATGC]+ DEL[0-9] DUP[0-9]
+1
I don't understand the issue If I'm not the only one, perhaps chat it through in a Beacon TC ? Tony
Heinz Stockinger wrote:
+1
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ga4gh/beacon-team/issues/20#issuecomment-230427033, or mute the thread https://github.com/notifications/unsubscribe/AI_EVAYvkq8YFxR2U54CKiLrn7ArH0RGks5qSiAzgaJpZM4IDB-p.
It's just a small change with respect to the proposal at the top of the page, i.e., we had:
INS[ATGC]+ DEL[0-9]* DUP
we now propose to update it to:
INS[ATGC]+ DEL[0-9] DUP[0-9]
i.e. only adding [0-9]* to DUP - so both, DUP and DEL have the possibility for integer values (as it is already the case in the current v0.3 and earlier specifications).
Aha - thanks!
+1
T
Heinz Stockinger wrote:
It's just a small change with respect to the proposal at the top of the page, i.e., we had:
INS[ATGC]+ DEL[0-9]* DUP
we now propose to update it to:
INS[ATGC]+ DEL[0-9] DUP[0-9]
i.e. only adding [0-9]* to DUP - so both, DUP and DEL have the possibility for integer values (as it is already the case in the current v0.3 and earlier specifications).
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ga4gh/beacon-team/issues/20#issuecomment-230438376, or mute the thread https://github.com/notifications/unsubscribe/AI_EVKHqydasshsiGIgqaiivo-RRUvuXks5qSixDgaJpZM4IDB-p.
Hi All
I'd like to reflect on some basics about 'what is a Beacon', and hence thereafter decide on what to turn it into...
Currently Beacon asks about 'a single base allele'. Originally this meant (1) "any subject-specific record where the query allele is present" (whether heterozygous or homozygous)
But we now allow people to use Beacon to ask (2) "any database record referring to the query allele" (could be population frequency data, protein structure or pathogenicity consequences, animal model correlates, etc)
We decided in Hinxton last week that both (1) and (2) are acceptable
Regarding queries that focus on records about the properties of human subjects (as opposed to the properties of variants) we have never yet tried to enable queries to distinguish between standard genotypes (homozygous or heterozygous presence of the query allele), but if we did this could quickly expand into asking about zygosity generally (e.g., hemizygous, Y chm markers or X markers in females, polyploidy, etc) ...and that would open a way into a general solution for genome counts of an allele (which could be fractional, ranges, >, < etc) ...which then, if openned up to variants other than single base changes, provides a way to handle copy number variation.
In parallel we'd also want a way to specify a query on a local haplotype (i.e., one chm rather than one genome level)
SO I PROPOSE WE ENABLE QUERIES THAT SPECIFY CLEANLY AND SIMPLY
Beyond this, I'd like to see the Beacon query language able to ask but whether data/records exist that relate to a genome region defined by a start and stop base (which could be one and the same), and how those data/annotations match the target region (exact|exceed|begin_between|end_between|begin_and_end_between|only_begin_between|only_end_between|begin_at_start|end_at_stop)
Plus a search option for specific sequence strings
Then one could easily imagine combinations of the above, eg:
'SNP allele' located 'exact' at 'Chm:2/start_base:6543' and 'Chm:2/stop_base:6543'
'SNP allele' located 'exact' at 'Chm:2/start_base:6543' and 'Chm:2/stop_base:6543' with 'Count_In_Genome > 0.5'
'TTAGGAGG' located 'begin_between' 'Chm:2/start_base:6543' and 'Chm:2/stop_base:6553'
'Copy number variant allele X' with 'Count_In_Genome > 4'
'Copy number variant allele X' with 'Count_In_haplotype > 2' where haplotype at 'Chm:2/start_base:5,000' and 'Chm:2/stop_base:100,000'
Thoughts...? Tony
Michael Baudis wrote:
@heinzstockinger https://github.com/heinzstockinger Copy number variations can have different quantitative levels. Based on a 2n allele count, deletions can lead to 1n or 0n (homozygous). For duplications there is no upper limit; in canver genomes, amplicons with hundreds of repeats of the same sequence can be found (sometimes including one or more complete CDRs; an example here is MYCN).
There are reasons to query specific copy number levels, e.g. to find only homozygous deletions.
The VCF file format allows to provide this information through |FORMAT| => |CN| ("Copy number genotype for imprecise events"); see pp. 13/14 of VCF 4.3.
Calling numerically correct copy numbers is difficult (especially in cancer w/ mixed cellularity etc.), and frequently data contains just DUP/DEL information instead of integer count values, with the possible addition of HOMODEL (i.e. 0n) and AMP (i.e. passing a arbitrary threshold, e.g ≧ 4).
While there are clearly use cases for this kind of granularity, implementation adds some complexity which makes only sense when there are repositories actually providing this type of data & not only the theoretical urge to do so (e.g. while we work on this for arraymap.org, integer CN calls are not implemented yet).
Conclusion:
* At least for 0.4 implement qualitative DUP/DEL calls. * Keep in mind a future extensibility towards integer CN thresholding.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ga4gh/beacon-team/issues/20#issuecomment-226687963, or mute the thread https://github.com/notifications/unsubscribe/AI_EVDjBma1psFWBHbckkhpIsSSpoKOXks5qMjmigaJpZM4IDB-p.
@antbro Maybe you move this to a separate doc which can be edited/commented on? I think it would be best to have the specific use cases listed & discussed, which is tricky with this format here on Github.
(overall your examples are in my line of thinking)
For info linked here, a write-up of options for range queries, using a VCF:INFO approach (but pointing to alternative use of other attributes):
https://docs.google.com/document/d/1uePLlLMl0FzxZxDrsF9IxsC2nYvZ84029fzUD1ULNWI/edit#
I was in believe that in version 0.4 we will implement complex variants like the ones in the document https://docs.google.com/document/d/1uePLlLMl0FzxZxDrsF9IxsC2nYvZ84029fzUD1ULNWI Am I wrong?
I believe the decision made on today's call was to implement the first step as described in https://github.com/ga4gh/beacon-team/issues/20#issuecomment-230438376 and put off the changes proposed in the document above to a later time.
There would just be the additional field to add: alternateBasesInfo. It's an optional parameter so it is completely backwards compatible.
Details are in the following pull request: https://github.com/ga4gh/beacon-team/pull/65
To summarize related decisions made during the workshop yesterday:
We're going with https://github.com/ga4gh/beacon-team/pull/94 over https://github.com/ga4gh/beacon-team/pull/95 as the base for implementation.
@mcupak I've added a comment to the DUP,DEL... PR https://github.com/ga4gh/beacon-team/blob/develop-proto-structural_and_ranges/src/main/proto/ga4gh/beacon.proto#L50
Actually, on re-reading VCF the reference value can stay "required", since values of A,C,G,T,N
are permitted (this is conceptually slightly different from a .
as recommended for a missing value, but practically the same).
Is this sufficiently verbose?
// Reference bases for this variant (starting from `start`).
//
// Accepted values: see the REF field in VCF 4.2 specification
// (https://samtools.github.io/hts-specs/VCFv4.2.pdf).
// When querying for variants without specific base alterations (e.g.
// imprecise structural variants with separate variant_type as well as
// start_min & end_min ... parameters), the use of a single "N" value is
// recommended.
string reference_bases = 8;
Closing since implemented in develop-proto branch.
Proposal by Michael Baudis, please elaborate if insufficient.
For example: INS[ATGC]+ DEL[0-9]* DUP
Discussion on interpretation and use cases was already started in this document: https://docs.google.com/document/d/1PfSt0o0m59BRs92PtyDcP31fUl8QgMYSTiclHAXCG0s/edit?usp=sharing