gavinha / TitanCNA

Analysis of subclonal copy number alterations (CNA) and loss of heterozygosity (LOH) in cancer
GNU General Public License v3.0
94 stars 36 forks source link

Using GATK CNV with TITAN #38

Open fpbarthel opened 6 years ago

fpbarthel commented 6 years ago

Hi!

I'm wondering how to use the new GATK4 copy number segmentation as input to TITAN. I guess it should be possible (see eg. https://software.broadinstitute.org/gatk/documentation/article?id=11088) but I'm not sure how. I was able to use a recent pipeline to convert its output to a format that mirrors the Allelic CapSeg output, which could be used for eg. ABSOLUTE. But I'd like to use TITAN.

Any advice?

Thanks!

gavinha commented 6 years ago

Hi @fpbarthel

Yes, I was working with the team at the Broad to include files that would be compatible with the input format for TITAN. You should be able to use 2 of the output files (gatk4cnv_acnv_titan_het_file_capture and gatk4cnv_acnv_titan_cr_file_capture) as input into TitanCNA. These are the only 2 input files required by TitanCNA when using titanCNA.R: https://github.com/gavinha/TitanCNA/blob/eb288ea1c5a8116bdad3a6588b83fdb6a55548f0/scripts/R_scripts/titanCNA.R#L35-L38

The current snakemake workflow that is in this GitHub repo does not support this but I may consider adding it in the future.

Hope this helps. Gavin

fpbarthel commented 6 years ago

Hey @gavinha thanks for your answer! I'm actually not at the Broad so the pipelines I made do not output those exact files, and I cannot find a Broad WDL file on their github to output those files either. The output of gatk ModelSegments (link) does output files similar to what mention.

Eg. XX.hets.tsv outputs a file with read counts at het sites and looks like this:

CONTIG POSITION REF_COUNT ALT_COUNT REF_NUCLEOTIDE ALT_NUCLEOTIDE
1 1147422 x y C T

And XX.cr.seg looks like a fairly standard copy number segment file with the following columns:

CONTIG START END NUM_POINTS_COPY_RATIO MEAN_LOG2_COPY_RATIO

Would those be appropriate? For tumor samples, the het sites take into account hets in the matched normal, but the segments themselves are not necessarily filtered and germline CNV events are included from what I understand.

gavinha commented 6 years ago

Hi @fpbarthel

Sorry I hadn't realized that original link was to Broad-specific workflows.

Here are the instructions for using GATK4-ACNV that should be more useful: https://gatkforums.broadinstitute.org/gatk/discussion/7387/description-and-examples-of-the-steps-in-the-acnv-case-workflow

You will want to run all 3 steps and the final output will include TITAN input files.

Let me know if this works. If so, I will consider including some of these steps in the TITAN snakemake. However, I want to make sure that GC content is handled (like it is in the current snakemake workflow in this repo).

Best, Gavin

chapmanb commented 6 years ago

Gavin; Thanks for this discussion. I'm also interested in using GATK4 inputs to TitanCNA. It looks like the aCNV workflow you link is considered legacy and was deleted. What's currently in GATK4 according to their WDL and latest documentation (https://software.broadinstitute.org/gatk/documentation/article?id=11682) is a bit different approach and doesn't have anything TITAN specific.

I'm happy to custom convert the right outputs of GATK into the copy number and allele frequency inputs but have a couple of questions about what TITAN expects:

Thanks again for helping support this.

fpbarthel commented 6 years ago

Hi @chapmanb and @gavinha, you may be interested in the (currently experimental/unsupported) GATK CNV post-processing WDL available here (realized I never linked to this in my original post), which I guess would take care of step 3 in the guide @gavinha shared(link). Not exactly sure yet how to convert it to something that can be used with TITAN, although I got it working for ABSOLUTE as the authors intended.

chapmanb commented 6 years ago

Thanks for this. The Broad CNV workflow WDLs are an incredible resource. They've really been helping me learn the correct workflow and how all the tools fit together.

That post-processing WDL has some a pretty extensive python bit modeling AllelicCapSeg output to fit into what ABSOLUTE needs. From my understanding, on the TitanCNA side most of these steps happen within TITAN, so I hope we can have a simpler conversion process. I'll definitely defer to Gavin though on what exactly we need.

That's great you've got ABSOLUTE working and I'd have a lot of interest in seeing how it compares to TitanCNA in your hands once we have the process working. Thanks again.

fpbarthel commented 6 years ago

Looks like you're right @chapmanb . I guess I wrongly assumed some of that GATK post-processing WDL (in particular germline filtering) were important pre-processing for Titan.

In GATK4 beta releases there seems to have been a tool called NormalizeSomaticReadCounts, which it looks like functions similar to current DenoiseReadCounts (see also the GATK 4 beta CNV guide compared to the updated CNV guide).

I did some digging digging into the beta GATK 4 source code, which has code for Titan conversion, and from that it looks like the output of NormalizeSomaticReadCounts is sufficient for Titan conversion (see here). So, I think your idea of simply using DenoiseReadCounts output as Titan input is correct. This also aligns with the smaller chunked segments that ichorCNA (which is used in the current Titan snakemake) outputs in its correctedDepth.txt output files, eg.:

chr     start   end     log2_TNratio_corrected
1       20001   21000   0.291683610726054
1       30001   31000   0.19357435202454
1       31001   32000   0.465723996473286
1       32001   33000   0.0834545261015968

Either way, despite the updates to GATK4 in the current release compared to the beta, the Titan conversion tools here should be helpful in figuring out exactly what to do.

fpbarthel commented 6 years ago

I figured I'd test it out using both files as input and just see what happens:

  1. Convert GATK hets from ModelSegments to Titan input:
    cat SAMPLE.hets.tsv | awk '/^[^@]/ { print $1,$2,$5,$3,$6,$4 }' | tr ' ' '\t' > tmp.hets.tsv 
  2. Convert chunky seg file from DenoiseReadCounts to Titan input:
    cat SAMPLE.denoisedCR.tsv | awk -F\t 'BEGIN{print"chr\tstart\tend\tlog2_TNratio_corrected"} /^[^@C]/ { print $0 }' > tmp.cr.tsv
  3. Convert merged seg file from ModelSegments to Titan input:
    cat SAMPLE.cr.seg | awk 'BEGIN{print"chr\tstart\tend\tlog2_TNratio_corrected"} /^[^@C]/ { print $1,$2,$3,$5 }' | tr ' ' '\t' > tmp2.cr.tsv
  4. Run TitanCNA for both
    
    Rscript `which titanCNA.R` --id tmp --hetFile tmp.hets.tsv --cnFile tmp2.cr.tsv --numClusters 1 --numCores 1 --normal_0 0.5 --ploidy_0 2 --chrs "c(1:22, \"X\")" --estimatePloidy TRUE --outDir ./ --libdir ./TitanCNA/

Rscript which titanCNA.R --id tmp2 --hetFile tmp.hets.tsv --cnFile tmp2.cr.tsv --numClusters 1 --numCores 1 --normal_0 0.5 --ploidy_0 2 --chrs "c(1:22, \"X\")" --estimatePloidy TRUE --outDir ./ --libdir ./TitanCNA/



Interestingly, Titan runs without issue for both inputs. However, the output from using the `DenoisedReadCounts` chunky seg file looks a lot better. Compare:

<img width="1328" alt="screen shot 2018-11-10 at 6 24 12 pm" src="https://user-images.githubusercontent.com/9220167/48307198-decf4680-e515-11e8-918e-4bb635556043.png">

<img width="1323" alt="screen shot 2018-11-10 at 6 23 27 pm" src="https://user-images.githubusercontent.com/9220167/48307193-c8c18600-e515-11e8-9a7c-7a9dc64e7b35.png">

@gavinha your further thoughts on this much appreciated. Also regarding the hets file, I suppose there are several files that could potentially be used here. I'm using the `hets.tsv` output of `ModelSegments` described as follows (also [link](https://software.broadinstitute.org/gatk/documentation/tooldocs/4.0.1.0/org_broadinstitute_hellbender_tools_copynumber_ModelSegments.php)):

> (Optional) Allelic-counts file containing the counts at sites genotyped as heterozygous in the case sample (.hets.tsv). This is a tab-separated values (TSV) file with a SAM-style header containing a read group sample name, a sequence dictionary, a row specifying the column headers contained in AllelicCountCollection.AllelicCountTableColumn, and the corresponding entry rows. This is only output if normal allelic counts are provided as input. 

It seems like these genotypes somehow incorporated the normal counts, which I guess is similar to what you are doing with [countPysam.py](https://github.com/gavinha/TitanCNA/blob/7a9fa7c14f196f8f5eb91e6dcaa6a1de85de065d/scripts/snakemake/code/countPysam.py)? One thing that is clearly missing is the quality score, but even in the GATK4 beta Titan conversion script they simply seem to leave this blank (see [here](https://github.com/broadinstitute/gatk/blob/4.beta.6/src/main/java/org/broadinstitute/hellbender/tools/exome/conversion/titanconversion/TitanAllelicCountWriter.java#L26)).
chapmanb commented 6 years ago

Thanks much for sharing this and for all the detective work resurrecting the old conversion scripts. You're exactly right that these are essentially file conversions of existing files and I'd implemented similar approaches to what you did as part of testing this out:

https://github.com/bcbio/bcbio-nextgen/blob/520fad7da7fa46635e6514b8bb6b12b988cc20ac/bcbio/structural/titancna.py#L162

The raw denoised log2 values (1000bp bins) versus segmentation bins comparison makes good sense as well since segmentation is designed to smooth out the noisiness. I've been feeding the denoised 1000bp inputs to TitanCNA in my initial tests with GATK4 inputs, wanting to give it the extra information present in the smaller bins but again will defer to Gavin's expertise.

Glad we're approaching this the same way and making progress. Thanks again.

fpbarthel commented 5 years ago

Happy to close this issue if no further action required, are you OK @chapmanb?

I feel confident using TITAN with the denoised segments from GATK4 as input, and the hets from ModelSegments.

The two plots in my post above are both CNA.pdf from using the denoised segments (top) and the larger segments from ModelSegments (bottom) as TITAN input.

In another post, Gavin writes:

The copy number segments are usually nice to look at but it is more difficult to assess the solutions. The plots that I like to use are the CNA.pdf and LOH.pdf.

To me it suggests that we want the CNA.pdf file to show a lot of data points, as in the top plot. Therefore, we should use the DenoiseReadCounts output rather than the ModelSegments output as TITAN input.

If one uses the ModelSegments output as TITAN input, the CNA.pdf and CNASEG.pdf files look almost identical. This cannot be correct imho.

gavinha commented 5 years ago

Hi @fpbarthel and @chapmanb

Thank you both for looking into this.

I feel confident using TITAN with the denoised segments from GATK4 as input, and the hets from ModelSegments.

This is what I would recommend as well because it is the input that TitanCNA expects.

To me it suggests that we want the CNA.pdf file to show a lot of data points, as in the top plot. Therefore, we should use the DenoiseReadCounts output rather than the ModelSegments output as TITAN input.

If one uses the ModelSegments output as TITAN input, the CNA.pdf and CNASEG.pdf files look almost identical. This cannot be correct imho.

Well, it's not necessarily incorrect. Using ModelSegments just means that the heavy lifting of coming up with segmented (smoothed) data is done by GATK instead of handled by the TitanCNA HMM, as intended. If users feel more confident using GATK's segmentation prior to TitanCNA's own HMM segmentation, then I think it could still work. There may be some tweaking of parameters such as --alphaK to adjust the hyperparameters on the prior of the Gaussian variance since there is much lower variability in the input.

Best, Gavin

chapmanb commented 5 years ago

Gavin and Floris; Thanks much for this, it's great to hear that our intuition about which GATK CNV input files to use worked out. I've also been doing comparisons using three different TCGA cancers with both TitanCNA and PureCN. The GATK CNV approach with denoised segments and hets works well and performs similarly to CNVkit input that I've also got in place. The sample by sample comparisons for the different cancer types are here if you have interest:

https://github.com/bcbio/bcbio_validations/tree/master/TCGA-heterogeneity

Thanks again for all the help and discussion, it's great to have this all in place in bcbio with some confidence bounds around the outputs. Much appreciated.

fpbarthel commented 5 years ago

Hi to both,

I just wanted to briefly follow up with a concern with using the GATK workflow directly. To calculate hets, ModelSegments first calls het site in the control sample and finally performs a pileup of those sites in the tumor sample. However, both tumor and normal hets are filtered using a (default) minimum depth threshold of 30x. This result in a proportion of the het sites being discarded due to insufficient coverage.

Looking at some of my results, I am noticing TITAN has called a far lower number of CDKN2A homozygous deletions than I am expecting (15% as opposed to 60% out of 600+ samples) and it seems that there are very few hets reported in this region in several samples.

This has led me to believe that perhaps setting a minimum (or maximum?) depth threshold for the tumor sample is not appropriate? Especially when we assume the same experimental strategies were used to generate the data for the tumor and matching normals, we should be able to expect equal coverage of this region in both tumor and control UNLESS there is a deletion in the tumor.

@gavinha I would be very interested to hear your thoughts on this.

  1. How well is TITAN able to call homozygous deletions?
  2. Does the lack of "het" data in HOMD regions affect TITANs ability to make a HOMD call?
  3. I've noticed that TITAN also has a --minDepth parameter, and while at 10x default this is much more generous than the GATK ModelSegments default of 30x, I am wondering what is the reason for this parameter? Is the absence of reads at het sites not telling for a deletion?

Floris

P.s. I've also raised my concern to the GATK team here

P.p.s. related to a similar issue which reported the same for amplifications https://github.com/gavinha/TitanCNA/issues/50

fpbarthel commented 5 years ago

Hi @gavinha wonder if you have any thoughts on this? I've noticed that the CDKN2A region is often missing from my results.

Floris