ga4gh / ga4gh-schemas

Models and APIs for Genomic data. RETIRED 2018-01-24
http://ga4gh.org
Apache License 2.0
214 stars 114 forks source link

BRCA Project Team - Initial Steps #154

Closed skeenan closed 9 years ago

skeenan commented 10 years ago

The BRCA challenge is a driving project of the GA4GH whose mission is to construct a global database of BRCA1 and BRCA2 variants to facilitate rapid and reliable clinical diagnosis and to support collaborative research to reduce their impact on human health.

This is a cross-cutting interdisciplinary project of the GA4GH and the DWG will be providing the informatics for the project and our working group is now forming a new project team to design APIs and build the database for the BRCA challenge.

The team is open to all DWG members able to contribute to these immediate practical steps:-

  1. To provide CWG-BRCA the capability of representing phased data.
  2. The team will work with CWG-BRCA to define the region of interest on GRCH38 (probably the BRCA1 and 2 genes + many KB on either side)
  3. The team will work with CWG-BRCA to define the "canonical transcript" on GRCH38 for each (coordinates of coding and UTR exons for a single designated transcript) start with the CCDS reference gene closest to what they want.
  4. The team will define a minimal API for pubicly available/"population" data like 1000 genomes
  5. The team will build a simple database holding the 1000 Genomes BRCA genome data as a demo of the GA4GH API.
haussler commented 10 years ago

Great discussion today on the DWG call. One important thing that I want to capture is the suggestion that the specification include the ability to re-call the variants once a large number of raw read sets have been compiled into the database. It is certain that somebody in the project will want to do this. This could be a major advantage of our approach over others who freeze the variant calls and are not able to go back to the raw reads to re-call.

pgrosu commented 10 years ago

@skeenan, This is a fantastic project!

@haussler, these are fairly standard implementations in search engine design, via inverted indices, which comes from the field of Information Retrieval (see my post #142). Another benefit is that these can easily be parallelized using MapReduce in Hadoop.

Were there other things discussed on the call? How do we join these calls? It's easier to provide ideas if we knew what was discussed, the next action items and their priorities.

In any case, below are the two sets of inverted indices:

Reads to Variants: reads2vars

Variants to Reads: vars2reads

Thanks, Paul

mlin commented 10 years ago

There would be lots of interesting technicalities to get excited / worried about in re-calling over heterogeneous datasets. Would we need to factor available NGS data by WGS/WES/panel, hi/lo coverage, sequencing instrument/chemistry, etc. to avoid massive batch effects in the joint calls. If/when data are generated with different mappers, you might actually want to realign before re-calling. Maybe the gVCF representation would help abstract that stuff away, but has its own limitations (discussed in #145). Lastly, suppose (hypothetically) we resign to merely merging frozen variant sets from different population sequencing projects, how do we even do that in a minimally-fallacious way if the projects have different target/coverage/chemistry/instruments/bioifx?

On the one hand it's tempting to stay within the comfortable confines of 1000G to avoid all this, and that's probably the only choice if ASHG is still a timeline factor. On the other hand these seem to me like issues the GA4GH needs to address head-on towards the goal of pooling everyone's data...

Finally. My notes of Heidi describing salient questions of allele frequency (AF) reference data which arise during a variant work-up:

haussler commented 10 years ago

Totally agree Mike. Re-calling is not something to attempt before ASHG. Supporting it should be a long term goal for the read and variant task teams, however messy it is. -D

On Wed, Sep 24, 2014 at 11:56 AM, Mike Lin notifications@github.com wrote:

There would be lots of interesting technicalities to get excited / worried about in re-calling over heterogeneous datasets. Would we need to factor available NGS data by WGS/WES/panel, hi/lo coverage, sequencing instrument/chemistry, etc. to avoid massive batch effects in the joint calls. If/when data are generated with different mappers, you might actually want to realign before re-calling. Maybe the gVCF representation would help abstract that stuff away, but has its own limitations (discussed in #145 https://github.com/ga4gh/schemas/issues/145). Lastly, suppose (hypothetically) we resign to merely merging frozen variant sets from different population sequencing projects, how do we even do that in a minimally-fallacious way if the projects have different target/coverage/chemistry/instruments/bioifx?

On the one hand it's tempting to stay within the comfortable confines of 1000G to avoid all this, and that's probably the only choice if ASHG is still a timeline factor. On the other hand these seem to me like issues the GA4GH needs to address head-on towards the goal of pooling everyone's data...

Finally. My notes of Heidi describing salient questions of allele frequency (AF) reference data which arise during a variant work-up:

  • If AF is high (in any subpopulation), then high-penetrance pathogenicity is unlkely
  • If AF is 0, need to ask if the allele really absent or merely a no-call? Useful to consult NGS read coverage.
  • If AF is really low, then we become intensely interested in pheno/clinical information on focal subjects.

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/154#issuecomment-56720408.

pgrosu commented 10 years ago

@mlin and @haussler, some of these things can easily be fixed by running standardized samples on multiple platforms with the different protocols to have thresholds in standardizing the quality control pipeline and any other pipelines we require. This of course will need to be analyzed to understand what are the differences in order to adjust any pipeline modules.

Regarding reads and variants, presently what is not captured as fields can be captured via the key-value pairs. These can be input parameters which can be used in our pipelines. Once we get to the pipeline step, we can hash out this project for analysis and any other large-scale analysis approaches we want to test. The parameters we can fix or adjust as necessary. Then we can compare with the analysis performed already for any published datasets.

In any case, if you have the Variants you must have performed the QC and all other steps using the reads beforehand. Thus getting back to the reads that generated these variants should be straightforward - including any necessary post-processing pipeline steps - using the data-structure that I suggested above.

dzerbino commented 10 years ago

Regarding point #3, the LRG team spent a lot of time annotating a "canonical" BRCA1 gene with experts in the field:

http://ftp.ebi.ac.uk/pub/databases/lrgex/LRG_292.xml

pgrosu commented 10 years ago

@dzerbino, I understand what you mean, but isn't this precisely within the scope of GA4GH? Below is a link, that I feel summarized the purpose and the scope of GA4GH perfectly:

https://www.coriell.org/media-center/press-releases/item/337-coriell-institute-aligns-with-international-genomics-consortium

GA4GH has quite a lot of expertise already, and I am sure we can attract more as necessary. I'm sure other experts in the field would be glad to provide the necessary recommendations as we progress. We have to start somewhere and show something, and with what we designed already such a project will provide us the framework of how to improve our design and implementation.

haussler commented 10 years ago

We will definitely use and defer to LRG. It is wonderful that they have already helped the community come together and agree, and the major groups annotating BRCA1 already refer to the same transcript (NM_007294.3), which is in LRG, CCDS, RefSeq, etc. and on all the browsers. Similar for BRCA2 ( NM_000059.3 http://www.ncbi.nlm.nih.gov/nuccore/NM_000059.3). It is a huge leg up on the BRCA challenge project. Now all we have to do is build API accessible ways to gather all the raw and interpreted data on these genes together. -D

On Thu, Sep 25, 2014 at 6:44 AM, Paul Grosu notifications@github.com wrote:

@dzerbino https://github.com/dzerbino, I understand what you mean, but isn't this precisely within the scope of GA4GH? Below is a link, that I feel summarized the purpose and the scope of of GA4GH perfectly:

https://www.coriell.org/media-center/press-releases/item/337-coriell-institute-aligns-with-international-genomics-consortium

GA4GH has quite a lot of expertise already, and I am sure we can attract more as necessary. I'm sure other experts in the field would be glad to provide the necessary recommendations as we progress. We have to start somewhere and show something, and with what we designed already such a project will provide us the framework of how to improve our design and implementation.

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/154#issuecomment-56820059.

pgrosu commented 10 years ago

Absolutely agree @haussler! And re-reading @dzerbino's post from the perspective of consensus, this can be a lot of fun, and a great "quick" win for everyone around, if we build on all the great resources already out there and inside GA4GH.

jzook commented 10 years ago

As part of our Genome in a Bottle and Benchmarking work, we've been working on ways to identify sites that have systematic errors. Between the work we've done and Heng Li's analysis of haploid genomes, it probably wouldn't be too hard to create a good list of sites that are prone to systematic errors in BRCA1 and BRCA2, and it seems like this is potentially a really useful resource for the community. Do you think this might be a useful goal to add to this project?

dzerbino commented 10 years ago

This sounds like a very useful idea indeed! I'm just wondering if anyone (e.g. in the 1000 genomes consortium) did not compile such a list already?

jzook commented 10 years ago

I don't know of any analyses 1000 Genomes has done like this, but I do expect we could learn some interesting things from the 1000 Genomes data related to systematic errors, like maybe flagging sites that are out of Hardy-Weinberg equilibrium, or that are filtered for some other reasons.

haussler commented 10 years ago

yes!

On Fri, Sep 26, 2014 at 5:04 AM, jzook notifications@github.com wrote:

As part of our Genome in a Bottle and Benchmarking work, we've been working on ways to identify sites that have systematic errors. Between the work we've done and Heng Li's analysis of haploid genomes, it probably wouldn't be too hard to create a good list of sites that are prone to systematic errors in BRCA1 and BRCA2, and it seems like this is potentially a really useful resource for the community. Do you think this might be a useful goal to add to this project?

  • Justin

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/154#issuecomment-56952613.

pgrosu commented 10 years ago

Here are a couple, but it requires teasing out the information :)

http://www.biomedcentral.com/1471-2164/15/516

http://www.nature.com/ejhg/journal/v21/n8/full/ejhg2012270a.html

~p