bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
986 stars 354 forks source link

Germline variant calling from normal samples of cancer variant calling #1173

Closed pengxiao78 closed 8 years ago

pengxiao78 commented 8 years ago

Hi, I have finished the tumor-normal paired cancer variant calling pipeline and found final vcf files for somatic mutations in "final" folder. However, currently, I also intend to get the germline variations for all the normal samples. Is it possible that I skip the alignment steps for the normal samples again by taking advantage of the available bam files generated during the cancer variant calling, and just perform the germline variant calling step for the normal samples in the previous "work" folder? How can I do that? Thank you!

chapmanb commented 8 years ago

The current development version of bcbio produces a germline output file along with the current somatic calls, so can do this for you without any additional work. If you upgrade to the latest development version:

bcbio_nextgen.py upgrade -u development

then re-run your pipeline in place it will calculate germline calls and add these to the final output directory. Hope this helps.

pengxiao78 commented 8 years ago

Hi Brad, I have finished the steps as you mentioned. However, this method looks like only generate variants from tumor-only samples with comparing with the normal samples. However, my question is to get the germline calling from normal samples but not tumor samples. Can I generate this by taking advantage of all the available bam files for the normal samples. How can I do that? Thank you! Peng

From: Brad Chapman [mailto:notifications@github.com] Sent: Thursday, January 07, 2016 7:19 PM To: chapmanb/bcbio-nextgen bcbio-nextgen@noreply.github.com Cc: Xiao, Peng peng.xiao@unmc.edu Subject: Re: [bcbio-nextgen] Germline variant calling from normal samples of cancer variant calling (#1173)

The current development version of bcbio produces a germline output file along with the current somatic calls, so can do this for you without any additional work. If you upgrade to the latest development version:

bcbio_nextgen.py upgrade -u development

then re-run your pipeline in place it will calculate germline calls and add these to the final output directory. Hope this helps.

— Reply to this email directly or view it on GitHubhttps://github.com/chapmanb/bcbio-nextgen/issues/1173#issuecomment-169859623.

The information in this e-mail may be privileged and confidential, intended only for the use of the addressee(s) above. Any unauthorized use or disclosure of this information is prohibited. If you have received this e-mail by mistake, please delete it and immediately contact the sender.

pengxiao78 commented 8 years ago

Sorry, a typo. I meant that this method looks like only generate variants from tumor-only samples WITHOUT comparing with the normal samples. However, my question is to get the germline calling from normal samples but not tumor samples. Thanks!

From: Xiao, Peng Sent: Friday, January 08, 2016 4:22 PM To: 'chapmanb/bcbio-nextgen' reply@reply.github.com Subject: RE: [bcbio-nextgen] Germline variant calling from normal samples of cancer variant calling (#1173)

Hi Brad, I have finished the steps as you mentioned. However, this method looks like only generate variants from tumor-only samples with comparing with the normal samples. However, my question is to get the germline calling from normal samples but not tumor samples. Can I generate this by taking advantage of all the available bam files for the normal samples. How can I do that? Thank you! Peng

From: Brad Chapman [mailto:notifications@github.com] Sent: Thursday, January 07, 2016 7:19 PM To: chapmanb/bcbio-nextgen bcbio-nextgen@noreply.github.com<mailto:bcbio-nextgen@noreply.github.com> Cc: Xiao, Peng peng.xiao@unmc.edu<mailto:peng.xiao@unmc.edu> Subject: Re: [bcbio-nextgen] Germline variant calling from normal samples of cancer variant calling (#1173)

The current development version of bcbio produces a germline output file along with the current somatic calls, so can do this for you without any additional work. If you upgrade to the latest development version:

bcbio_nextgen.py upgrade -u development

then re-run your pipeline in place it will calculate germline calls and add these to the final output directory. Hope this helps.

— Reply to this email directly or view it on GitHubhttps://github.com/chapmanb/bcbio-nextgen/issues/1173#issuecomment-169859623.

The information in this e-mail may be privileged and confidential, intended only for the use of the addressee(s) above. Any unauthorized use or disclosure of this information is prohibited. If you have received this e-mail by mistake, please delete it and immediately contact the sender.

chapmanb commented 8 years ago

Peng; The germline calling does use the normal sample so is doing what you're looking for. Given a tumor/normal pair you get:

Hope this provides what you're looking for.

pengxiao78 commented 8 years ago

Brad, The germline calls only have germline vcf file from each variant caller. Is there anyway that I can also get one ensemble germline vcf file? Thanks!

chapmanb commented 8 years ago

Peng; We don't currently have a way to do this but is something we'd like to do in the future. It's a larger project since tools have different amounts of support for calling germline and we might also want to be able to include germline calls from approaches like GATK HaplotypeCaller or FreeBayes. Sorry to not have anything you can use for this right now.

pengxiao78 commented 8 years ago

Brad, How can I find germline mutations unique to the normal and lost in the tumor due to coverage deletion issues in the germline vcf file? I tried to find it in GT codes in genotype fields but still have no clear idea. Thanks!

pengxiao78 commented 8 years ago

In addition, I found that all the vcf files (either somatic vcfs or germline vcfs) generated from mutect have a problematic format column for germline sample. For example, the format column for tumor is 0/1:73,1:36:74:0.014 but the format column for germline (the last column in the mutect vcf file) is 0:427,26:.:284:0.057. Therefore, FILTER column for all variants in the somatic mutect vcf are REJECT. So I am afraid that all the variants in mutect caller have not been considered in the ensemble method. For the germline mutect vcf files, the same errors are there for germline format but just the FILTER shows PASS. Could you please help me to debug and fix it? Thanks!

chapmanb commented 8 years ago

I'm not sure if I fully understand your questions but my thoughts are:

Practically, if you really care about germline right now I'd suggest doing a separate variantcalling pipeline with the normal sample using a germline callers like GATK HaplotypeCaller or FreeBayes. Ideally this is something we could incorporate happening automatically as part of a single run but we've not yet done that work to automate it in bcbio. Hope this helps.

pengxiao78 commented 8 years ago

Thanks, Brad. For the MuTect, I might not describe it clearly. I meant that in all the previous somatic mutation pipeline generated mutect.vcf but not the mutect-germline.vcf, The vcf GT format is truncated and all the variants have been rejected by the filter, which is the same case for all the previous mutect vcf files. So, although that I used variantcaller: [mutect, freebayes, vardict, varscan] in the yaml file and used ensemble >= 2 criteria, I found that mutect variants have not been used into the ensemble selection at all because of the disformated GT for one (normal) of the two (normal, tumor) samples, such as GT:AD:DP:FREQ 0:15,0:10:0.00 0/1:2,0:2:0.00. I also attached this example line in the mutect.vcf as follows for details. However, previously, I did not pay attention to this problem since I just directly used the ensemble vcf file but did not check each specific caller's vcf. The other three callers' vcf files are all fine. So is there any way to fix the MuTect vcf error in the somatic calling? Otherwise, the MuTect caller is not useful at all. Thanks again.

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Germ Tumor

1 10070 . C G . REJECT EFF=UPSTREAM(MODIFIER||1799|||DDX11L1|processed_transcript|NON_CODING|ENST00000456328||G),UPSTREAM(MODIFIER||1802|||DDX11L1|transcribed_unprocessed_pseud ogene|NON_CODING|ENST00000515242||G),UPSTREAM(MODIFIER||1804|||DDX11L1|transcribed_unprocessed_pseudogene|NON_CODING|ENST00000518655||G),UPSTREAM(MODIFIER||1940|||DDX11L1|transcribed_unprocessed_pseudogene|NON _CODING|ENST00000450305||G),DOWNSTREAM(MODIFIER||4293|||WASH7P|unprocessed_pseudogene|NON_CODING|ENST00000438504||G),DOWNSTREAM(MODIFIER||4293|||WASH7P|unprocessed_pseudogene|NON_CODING|ENST00000541675||G),DOW NSTREAM(MODIFIER||4293|||WASH7P|unprocessed_pseudogene|NON_CODING|ENST00000423562||G),DOWNSTREAM(MODIFIER||4334|||WASH7P|unprocessed_pseudogene|NON_CODING|ENST00000488147||G),DOWNSTREAM(MODIFIER||4341|||WASH7P |unprocessed_pseudogene|NON_CODING|ENST00000538476||G),INTERGENIC(MODIFIER||||||||||G) GT:AD:DP:FREQ 0:15,0:10:0.00 0/1:2,0:2:0.00

chapmanb commented 8 years ago

Thanks for the additional detail and sorry I wasn't clear in my last response. That's how MuTect output looks, it's not truncated. Both PASS and REJECT have the same structure. A typical MuTect VCF is mostly REJECT -- for instance the DREAM synthetic 4 data with MuTect has 20,000 PASS variants and 4.6 million REJECT. If you've searched and there are zero PASS then MuTect is not detecting anything in your input data set that's somatic, and it might be worth looking more closely at the input. Hope this helps.