bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
986 stars 354 forks source link

muTect: Support single sample mode and / or providing normal_panel vcf (instead of BAM) #273

Closed mjafin closed 10 years ago

mjafin commented 10 years ago

Happy to look into this, but it would be good if muTect supported running in tumour only mode or alternatively taking a vcf file instead of BAM for the normal(s).

For the first case it's sufficient to just ignore -I:normal and for the second case it's possible to use the -normal_panel option. In some cases the normal panel would/could be the same for all tumour samples (e.g. calls for a parental cell line).

lbeltrame commented 10 years ago

I'm guessing that in this case, at least for tumor only mode MuTect could be adjusted to be used like VarScan or FreeBayes, where the presence or not of paired samples can guide one or the other analysis. Brad, what do you think?

mjafin commented 10 years ago

Hi Luca, It might be as simple as conditionally checking if the normal sample exists and if it doesn't, omitting the lines

params += ["-I:normal", normal_bam]
params += ["--normal_sample_name", normal_sample_name]

The other interesting modification would be to allow to use a single pool of of normals (or e.g. parental cell line) BAM file matched to multiple tumor samples. This might be possible by providing multiple batch values for a single sample?

mjafin commented 10 years ago

Further, as it turns, the MuTect tool does not call indels so might be worth incorporating the SomaticIndelDetector

lbeltrame commented 10 years ago

Do you have some documentation at hand? It may be worth adding it to the Mutect bit.

mjafin commented 10 years ago

Hi Luca, Given how much Appistry ask for muTect, I think the documentation they offer is incredibly poor :) I've only got anecdotes and forum posts to offer unfortunately! http://gatkforums.broadinstitute.org/discussion/comment/1992/#Comment_1992 "The Indel Genotyper was renamed the Somatic Indel Detector. It had/has nothing to do with the Unified Genotyper. For somatic SNP calls you should look at MuTect."

In muTect help there is an option for providing a vcf instead of BAM for the normals:

 -normal_panel,--normal_panel <normal_panel>                                              VCF file of sites observed in
                                                                                          normal

The SomaticIndelDetector does not support a 'normal' vcf file and must be provided one or two BAM files.

chapmanb commented 10 years ago

Miika and Luca; Supporting these approaches makes good sense. For the no matched normal case, Miika's suggestion is perfect: we need to have a more lenient check if we are doing cancer (since currently it looks for a batch with tumor/normal) and a conditional if there is a matched normal. For the panel case, we'd want to label the VCF input as the normal, specify it using the syntax for pre-called VCFs (example here: https://github.com/chapmanb/bcbio-nextgen/blob/master/tests/data/automated/run_info-bam.yaml#L8), then have the MuTect wrapper detect and apply the right arguments.

Having indel calls is also a nice improvement and we'd need to enable logic to skip it in the normal-panel-VCF case.

Thanks again for the discussion and looking into this.

lbeltrame commented 10 years ago

I'm wondering if instead of Yet Another Variant Caller Configuration I should put the somatic indel detector inside the MuTect run, given it's the only tool at our disposal for paired analyses which has no indel calling.

The API for is_paired* which I used in FreeBayes (and MuTect needs to be ported still) might need some extra adjustments to handle the case of having only a tumor sample (or perhaps "go paired, fall back to tumor only if normal not present?").

mjafin commented 10 years ago

Brad and Luca, Agree, bundle MuTect and SomaticIndelDetector in the same run. To sum up:

How could we pass a single normal BAM to be used for multiple tumor BAMs? Any ideas? Multiple values in the batch field separated by semicolons or something like that?

mjafin commented 10 years ago

Actually, the panel of normals vcf is good to be assigned to all mutect runs if available (regardless of whether the normal BAM is there or not) as per the mutect paper.

Luca, are you actively working on this? I could have a look today and tomorrow for the first few easy mods.

lbeltrame commented 10 years ago

Luca, are you actively working on this? I could have a look today and tomorrow for the first few easy mods.

Unfortunately I'm a bit swamped with the work my additions to bcbio-nextgen allowed to do. ;) Feel fre to start working on this: I can review your additions as you post them.

If you plan easy modifications, have a look at what I did to Freebayes support to check if samples are paired or not: the API is already there so you could make use of it.

chapmanb commented 10 years ago

Miika and Luca; Thanks for the discussion and getting started on this. The best way to handle normal panels and BAMs that match multiple tumors is not yet implemented, but I agree the approach will be to have a normal and/or panel with multiple batches defined as a list of names in the metadata, as Miika suggested. I'll need to add support for plugging this into the logic that assigns batches. Hopefully waiting for this won't hold up implementing the other improvements. Thanks again.

tanglingfung commented 10 years ago

(sorry to chime in) I wonder if the same idea can be extended to multiple samples variant calling where a panel of normal bams can be used with a single sample. (this may be off topic)

mjafin commented 10 years ago

Hi Paul, Brad, Luca, Paul, muTect does not support using multiple normal BAMs it would seem. The approach they describe in the paper is to call the normals separately (using muTect in the tumor-only mode with the normals given as "tumor" BAMs...), combining the variant calls (for variants that are called in at least two samples) and then passing the panel as a vcf. Did you have something else in mind maybe?

Brad, what would be the syntax to access the "vrn_file"correctly in mutect.py? Or do you want to refactor everything before I start any of this work? I guess I could still do the tumor-only modifications now?

mjafin commented 10 years ago

Bit more brainstorming here.. An alternative to modifying the underlying batch system would be to have multiple calls to mutect within a single call to mutect.py and put the normal and all tumour samples into the same batch. For example, if we had the following bam files all in one batch:

Then the mutect function would call mutect twice, once for normal-tumor1 and the second time for normal-tumor2. Would this be bad practice?

chapmanb commented 10 years ago

Miika; Nice idea. The downside is that this loses parallelization, since you'd have to process all of these serially within a single core assigned to do the variant calling. I think the batch updates would work, I just need to carve out some time to implement and test it.

To get a variation file, you should just need data["vrn_file"] (where data is one of the elements from the items list) since it will be stored in the top level dictionary. However, it looks like this will need some refactoring in variation/genotype:variantcall_sample since we are assuming BAM files feed into variant calling there. I can also look at this if it get too hairy on your side since it could use a better abstraction here.

Sorry this is so messy, since it's a bit different that what we've supported previously. Hope this makes some sense and happy to work on the code to clean it up if that would help support what you're doing.

lbeltrame commented 10 years ago

mutect.py? Or do you want to refactor everything before I start any of this work? I guess I could still do the tumor-only modifications now?

I would suggest going for this as it looks (to me) probably the least invasive solution (and a welcome change).

I'll open a separate issue to track the SomaticIndelDetector support.

mjafin commented 10 years ago

Brad, (Are you up very early or very very late?-) Roger that regarding the batching - losing parallelisation wouldn't be good. I'll hold off until you've finished with the changes to the batching system. With the variation file, as long as it's indexed, it's only an additional parameter to mutect (regardless of whether it's paired or tumour only BAM calling). The file could obviously be subsetted too to match the regions of the input BAM files.

I'll get started with the tumor-only mode and possibly the vcf addition over the weekend. I'll also have a look at Luca's suggestion to streamline mutect the way it's done for FreeBayes paired calling.

Thanks for the helpful discussion!

chapmanb commented 10 years ago

Miika; I'm in Ireland and Scotland for the next week, visiting my wife's family and going to the DebianMed hackathon (packages + docker = good stuff; will be working on having full versioning and manifests for what is on an image which is a key step to move us closer to IPython/docker integration), so will be on normal time for y'all for the next few days.

For the variation file, the main change you want to consider in refactoring is identifying whether it's a standard call (has an non-None value in align_bams) or is a variation file, in which case you'll need to fetch it from the right items[index]["vrn_file"] before feeding in. Otherwise sounds great. Thanks again for doing this and sorry to make y'all wait on the refactoring. Yell if I hold things up too much.

mjafin commented 10 years ago

Brad, Ah, great, you're actually physically very close to our site then (within 500km radius;). Tania H. did mention a few weeks ago something about you guys coming over to the UK/Ireland for a visit. The docker stuff sounds great - we prefer to keep the stable releases (i.e. production versions for us) separate and accessible and if that works in docker all the better.

I'll check for non-None in align_bams and items[index]["vrn_file"] independently and will do some testing over the next few days hopefully. Enjoy Ireland & Scotland and if you're anywhere near Manchester come over for a visit!

mjafin commented 10 years ago

It would seem SomaticIndelDetector is in the muTect .jar from version 1.1.6 onwards - at least I can't see it in 1.1.5. Maybe bump to 1.1.6 at some point?

lbeltrame commented 10 years ago

I have no issues with bumping to 1.1.6, especially since indel calling for MuTect will be very useful. Is there any documentation about the addition to MuTect (although I'm expecting none, given how "well" it was documented)?

chapmanb commented 10 years ago

Miika and Luca; Happy to bump this, although I can't find any 1.1.6 downloads available from GitHub (1.1.5) or Broad (1.1.4). Am I missing a location? I'm only trying to avoid needing to build from source if possible.

mjafin commented 10 years ago

Ah, apologies! I've been using the Appistry releases and they include the SomaticIndelDetector whereas I can now see the free/academic versions do not. I made the assumption that they mirror each others' functionality which they clearly don't!

Edit: I have two bundles from Appistry, one with MuTect 1.1.4 and the other MuTect 1.1.6. Both contain the SID too.

mjafin commented 10 years ago

Brad, I've been working on the single sample support and providing the panel of normals VCF. I implemented the panel as a field in metadata together with checking of tumor-only: https://github.com/mjafin/bcbio-nextgen/commit/5cb3f52d8bd0728accb9b7f8e3c59306648581d9

I haven't made a pull request yet as I wanted your feedback first (or is it easier via a pull request?). I think with the single sample mode some downstream steps should be skipped, such as the Gatk hard filters. Oddly enough, on the small test data (chrM) in single sample mode the ploidyfix step filtered all three variants in the raw MuTect output.

lbeltrame commented 10 years ago

I haven't made a pull request yet as I wanted your feedback first (or is it easier via a pull request?). I think with the single sample mode some

Personally I think that a pull request would be best, so we can also keep track of comments and revisions. Most of my work for paired tumor-normal calls has been done with WIP pull requests.

roryk commented 10 years ago

+1 for pull request

mjafin commented 10 years ago

As you wish: https://github.com/chapmanb/bcbio-nextgen/pull/285

mjafin commented 10 years ago

I got a confirmation that the SomaticIndelDetector in the Appistry bundle is the same as that in Gatk 2.3.9 (deprecated post 2.3.9). Apparently the Broad are working on a new version but don't know about any timelines.

mjafin commented 10 years ago

Closing as per https://github.com/chapmanb/bcbio-nextgen/pull/293 Let's have a think about sometic indel calling in another thread, another time