Closed skeenan closed 9 years ago
Great discussion today on the DWG call. One important thing that I want to capture is the suggestion that the specification include the ability to re-call the variants once a large number of raw read sets have been compiled into the database. It is certain that somebody in the project will want to do this. This could be a major advantage of our approach over others who freeze the variant calls and are not able to go back to the raw reads to re-call.
@skeenan, This is a fantastic project!
@haussler, these are fairly standard implementations in search engine design, via inverted indices, which comes from the field of Information Retrieval (see my post #142). Another benefit is that these can easily be parallelized using MapReduce in Hadoop.
Were there other things discussed on the call? How do we join these calls? It's easier to provide ideas if we knew what was discussed, the next action items and their priorities.
In any case, below are the two sets of inverted indices:
Reads to Variants:
Variants to Reads:
Thanks, Paul
There would be lots of interesting technicalities to get excited / worried about in re-calling over heterogeneous datasets. Would we need to factor available NGS data by WGS/WES/panel, hi/lo coverage, sequencing instrument/chemistry, etc. to avoid massive batch effects in the joint calls. If/when data are generated with different mappers, you might actually want to realign before re-calling. Maybe the gVCF representation would help abstract that stuff away, but has its own limitations (discussed in #145). Lastly, suppose (hypothetically) we resign to merely merging frozen variant sets from different population sequencing projects, how do we even do that in a minimally-fallacious way if the projects have different target/coverage/chemistry/instruments/bioifx?
On the one hand it's tempting to stay within the comfortable confines of 1000G to avoid all this, and that's probably the only choice if ASHG is still a timeline factor. On the other hand these seem to me like issues the GA4GH needs to address head-on towards the goal of pooling everyone's data...
Finally. My notes of Heidi describing salient questions of allele frequency (AF) reference data which arise during a variant work-up:
Totally agree Mike. Re-calling is not something to attempt before ASHG. Supporting it should be a long term goal for the read and variant task teams, however messy it is. -D
On Wed, Sep 24, 2014 at 11:56 AM, Mike Lin notifications@github.com wrote:
There would be lots of interesting technicalities to get excited / worried about in re-calling over heterogeneous datasets. Would we need to factor available NGS data by WGS/WES/panel, hi/lo coverage, sequencing instrument/chemistry, etc. to avoid massive batch effects in the joint calls. If/when data are generated with different mappers, you might actually want to realign before re-calling. Maybe the gVCF representation would help abstract that stuff away, but has its own limitations (discussed in #145 https://github.com/ga4gh/schemas/issues/145). Lastly, suppose (hypothetically) we resign to merely merging frozen variant sets from different population sequencing projects, how do we even do that in a minimally-fallacious way if the projects have different target/coverage/chemistry/instruments/bioifx?
On the one hand it's tempting to stay within the comfortable confines of 1000G to avoid all this, and that's probably the only choice if ASHG is still a timeline factor. On the other hand these seem to me like issues the GA4GH needs to address head-on towards the goal of pooling everyone's data...
Finally. My notes of Heidi describing salient questions of allele frequency (AF) reference data which arise during a variant work-up:
- If AF is high (in any subpopulation), then high-penetrance pathogenicity is unlkely
- If AF is 0, need to ask if the allele really absent or merely a no-call? Useful to consult NGS read coverage.
- If AF is really low, then we become intensely interested in pheno/clinical information on focal subjects.
— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/154#issuecomment-56720408.
@mlin and @haussler, some of these things can easily be fixed by running standardized samples on multiple platforms with the different protocols to have thresholds in standardizing the quality control pipeline and any other pipelines we require. This of course will need to be analyzed to understand what are the differences in order to adjust any pipeline modules.
Regarding reads and variants, presently what is not captured as fields can be captured via the key-value pairs. These can be input parameters which can be used in our pipelines. Once we get to the pipeline step, we can hash out this project for analysis and any other large-scale analysis approaches we want to test. The parameters we can fix or adjust as necessary. Then we can compare with the analysis performed already for any published datasets.
In any case, if you have the Variants you must have performed the QC and all other steps using the reads beforehand. Thus getting back to the reads that generated these variants should be straightforward - including any necessary post-processing pipeline steps - using the data-structure that I suggested above.
Regarding point #3, the LRG team spent a lot of time annotating a "canonical" BRCA1 gene with experts in the field:
@dzerbino, I understand what you mean, but isn't this precisely within the scope of GA4GH? Below is a link, that I feel summarized the purpose and the scope of GA4GH perfectly:
GA4GH has quite a lot of expertise already, and I am sure we can attract more as necessary. I'm sure other experts in the field would be glad to provide the necessary recommendations as we progress. We have to start somewhere and show something, and with what we designed already such a project will provide us the framework of how to improve our design and implementation.
We will definitely use and defer to LRG. It is wonderful that they have already helped the community come together and agree, and the major groups annotating BRCA1 already refer to the same transcript (NM_007294.3), which is in LRG, CCDS, RefSeq, etc. and on all the browsers. Similar for BRCA2 ( NM_000059.3 http://www.ncbi.nlm.nih.gov/nuccore/NM_000059.3). It is a huge leg up on the BRCA challenge project. Now all we have to do is build API accessible ways to gather all the raw and interpreted data on these genes together. -D
On Thu, Sep 25, 2014 at 6:44 AM, Paul Grosu notifications@github.com wrote:
@dzerbino https://github.com/dzerbino, I understand what you mean, but isn't this precisely within the scope of GA4GH? Below is a link, that I feel summarized the purpose and the scope of of GA4GH perfectly:
GA4GH has quite a lot of expertise already, and I am sure we can attract more as necessary. I'm sure other experts in the field would be glad to provide the necessary recommendations as we progress. We have to start somewhere and show something, and with what we designed already such a project will provide us the framework of how to improve our design and implementation.
— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/154#issuecomment-56820059.
Absolutely agree @haussler! And re-reading @dzerbino's post from the perspective of consensus, this can be a lot of fun, and a great "quick" win for everyone around, if we build on all the great resources already out there and inside GA4GH.
As part of our Genome in a Bottle and Benchmarking work, we've been working on ways to identify sites that have systematic errors. Between the work we've done and Heng Li's analysis of haploid genomes, it probably wouldn't be too hard to create a good list of sites that are prone to systematic errors in BRCA1 and BRCA2, and it seems like this is potentially a really useful resource for the community. Do you think this might be a useful goal to add to this project?
This sounds like a very useful idea indeed! I'm just wondering if anyone (e.g. in the 1000 genomes consortium) did not compile such a list already?
I don't know of any analyses 1000 Genomes has done like this, but I do expect we could learn some interesting things from the 1000 Genomes data related to systematic errors, like maybe flagging sites that are out of Hardy-Weinberg equilibrium, or that are filtered for some other reasons.
yes!
On Fri, Sep 26, 2014 at 5:04 AM, jzook notifications@github.com wrote:
As part of our Genome in a Bottle and Benchmarking work, we've been working on ways to identify sites that have systematic errors. Between the work we've done and Heng Li's analysis of haploid genomes, it probably wouldn't be too hard to create a good list of sites that are prone to systematic errors in BRCA1 and BRCA2, and it seems like this is potentially a really useful resource for the community. Do you think this might be a useful goal to add to this project?
- Justin
— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/154#issuecomment-56952613.
Here are a couple, but it requires teasing out the information :)
http://www.biomedcentral.com/1471-2164/15/516
http://www.nature.com/ejhg/journal/v21/n8/full/ejhg2012270a.html
~p
The BRCA challenge is a driving project of the GA4GH whose mission is to construct a global database of BRCA1 and BRCA2 variants to facilitate rapid and reliable clinical diagnosis and to support collaborative research to reduce their impact on human health.
This is a cross-cutting interdisciplinary project of the GA4GH and the DWG will be providing the informatics for the project and our working group is now forming a new project team to design APIs and build the database for the BRCA challenge.
The team is open to all DWG members able to contribute to these immediate practical steps:-