Open fpbarthel opened 6 years ago
Hi @fpbarthel
Yes, I was working with the team at the Broad to include files that would be compatible with the input format for TITAN. You should be able to use 2 of the output files (gatk4cnv_acnv_titan_het_file_capture
and gatk4cnv_acnv_titan_cr_file_capture
) as input into TitanCNA. These are the only 2 input files required by TitanCNA when using titanCNA.R:
https://github.com/gavinha/TitanCNA/blob/eb288ea1c5a8116bdad3a6588b83fdb6a55548f0/scripts/R_scripts/titanCNA.R#L35-L38
The current snakemake workflow that is in this GitHub repo does not support this but I may consider adding it in the future.
Hope this helps. Gavin
Hey @gavinha thanks for your answer! I'm actually not at the Broad so the pipelines I made do not output those exact files, and I cannot find a Broad WDL file on their github to output those files either. The output of gatk ModelSegments
(link) does output files similar to what mention.
Eg. XX.hets.tsv
outputs a file with read counts at het sites and looks like this:
CONTIG | POSITION | REF_COUNT | ALT_COUNT | REF_NUCLEOTIDE | ALT_NUCLEOTIDE |
---|---|---|---|---|---|
1 | 1147422 | x | y | C | T |
And XX.cr.seg
looks like a fairly standard copy number segment file with the following columns:
CONTIG | START | END | NUM_POINTS_COPY_RATIO | MEAN_LOG2_COPY_RATIO |
---|
Would those be appropriate? For tumor samples, the het sites take into account hets in the matched normal, but the segments themselves are not necessarily filtered and germline CNV events are included from what I understand.
Hi @fpbarthel
Sorry I hadn't realized that original link was to Broad-specific workflows.
Here are the instructions for using GATK4-ACNV that should be more useful: https://gatkforums.broadinstitute.org/gatk/discussion/7387/description-and-examples-of-the-steps-in-the-acnv-case-workflow
You will want to run all 3 steps and the final output will include TITAN input files.
Let me know if this works. If so, I will consider including some of these steps in the TITAN snakemake. However, I want to make sure that GC content is handled (like it is in the current snakemake workflow in this repo).
Best, Gavin
Gavin; Thanks for this discussion. I'm also interested in using GATK4 inputs to TitanCNA. It looks like the aCNV workflow you link is considered legacy and was deleted. What's currently in GATK4 according to their WDL and latest documentation (https://software.broadinstitute.org/gatk/documentation/article?id=11682) is a bit different approach and doesn't have anything TITAN specific.
I'm happy to custom convert the right outputs of GATK into the copy number and allele frequency inputs but have a couple of questions about what TITAN expects:
The DenoiseReadCounts (https://software.broadinstitute.org/gatk/documentation/tooldocs/4.0.10.0/org_broadinstitute_hellbender_tools_copynumber_DenoiseReadCounts.php) tool handles GC correction and normalization against a single sample or panel of normals. Is that sufficient for the GC content handling you mention?
ModelSegments (https://software.broadinstitute.org/gatk/documentation/tooldocs/4.0.10.0/org_broadinstitute_hellbender_tools_copynumber_ModelSegments.php) performs segmentation of the normalized input from DenoiseReadCounts, producing larger segments with nearby similar depth/freqs merged. Does TitanCNA prefer this file, or the more chunked log2 input from the DenoiseReadCounts step? I'm not sure which level of processed data is more appropriate.
Thanks again for helping support this.
Hi @chapmanb and @gavinha, you may be interested in the (currently experimental/unsupported) GATK CNV post-processing WDL available here (realized I never linked to this in my original post), which I guess would take care of step 3 in the guide @gavinha shared(link). Not exactly sure yet how to convert it to something that can be used with TITAN, although I got it working for ABSOLUTE as the authors intended.
Thanks for this. The Broad CNV workflow WDLs are an incredible resource. They've really been helping me learn the correct workflow and how all the tools fit together.
That post-processing WDL has some a pretty extensive python bit modeling AllelicCapSeg output to fit into what ABSOLUTE needs. From my understanding, on the TitanCNA side most of these steps happen within TITAN, so I hope we can have a simpler conversion process. I'll definitely defer to Gavin though on what exactly we need.
That's great you've got ABSOLUTE working and I'd have a lot of interest in seeing how it compares to TitanCNA in your hands once we have the process working. Thanks again.
Looks like you're right @chapmanb . I guess I wrongly assumed some of that GATK post-processing WDL (in particular germline filtering) were important pre-processing for Titan.
In GATK4 beta releases there seems to have been a tool called NormalizeSomaticReadCounts
, which it looks like functions similar to current DenoiseReadCounts
(see also the GATK 4 beta CNV guide compared to the updated CNV guide).
I did some digging digging into the beta GATK 4 source code, which has code for Titan conversion, and from that it looks like the output of NormalizeSomaticReadCounts
is sufficient for Titan conversion (see here). So, I think your idea of simply using DenoiseReadCounts
output as Titan input is correct. This also aligns with the smaller chunked segments that ichorCNA (which is used in the current Titan snakemake) outputs in its correctedDepth.txt
output files, eg.:
chr start end log2_TNratio_corrected
1 20001 21000 0.291683610726054
1 30001 31000 0.19357435202454
1 31001 32000 0.465723996473286
1 32001 33000 0.0834545261015968
Either way, despite the updates to GATK4 in the current release compared to the beta, the Titan conversion tools here should be helpful in figuring out exactly what to do.
I figured I'd test it out using both files as input and just see what happens:
ModelSegments
to Titan input:
cat SAMPLE.hets.tsv | awk '/^[^@]/ { print $1,$2,$5,$3,$6,$4 }' | tr ' ' '\t' > tmp.hets.tsv
DenoiseReadCounts
to Titan input:
cat SAMPLE.denoisedCR.tsv | awk -F\t 'BEGIN{print"chr\tstart\tend\tlog2_TNratio_corrected"} /^[^@C]/ { print $0 }' > tmp.cr.tsv
ModelSegments
to Titan input:
cat SAMPLE.cr.seg | awk 'BEGIN{print"chr\tstart\tend\tlog2_TNratio_corrected"} /^[^@C]/ { print $1,$2,$3,$5 }' | tr ' ' '\t' > tmp2.cr.tsv
Rscript `which titanCNA.R` --id tmp --hetFile tmp.hets.tsv --cnFile tmp2.cr.tsv --numClusters 1 --numCores 1 --normal_0 0.5 --ploidy_0 2 --chrs "c(1:22, \"X\")" --estimatePloidy TRUE --outDir ./ --libdir ./TitanCNA/
Rscript which titanCNA.R
--id tmp2 --hetFile tmp.hets.tsv --cnFile tmp2.cr.tsv --numClusters 1 --numCores 1 --normal_0 0.5 --ploidy_0 2 --chrs "c(1:22, \"X\")" --estimatePloidy TRUE --outDir ./ --libdir ./TitanCNA/
Interestingly, Titan runs without issue for both inputs. However, the output from using the `DenoisedReadCounts` chunky seg file looks a lot better. Compare:
<img width="1328" alt="screen shot 2018-11-10 at 6 24 12 pm" src="https://user-images.githubusercontent.com/9220167/48307198-decf4680-e515-11e8-918e-4bb635556043.png">
<img width="1323" alt="screen shot 2018-11-10 at 6 23 27 pm" src="https://user-images.githubusercontent.com/9220167/48307193-c8c18600-e515-11e8-9a7c-7a9dc64e7b35.png">
@gavinha your further thoughts on this much appreciated. Also regarding the hets file, I suppose there are several files that could potentially be used here. I'm using the `hets.tsv` output of `ModelSegments` described as follows (also [link](https://software.broadinstitute.org/gatk/documentation/tooldocs/4.0.1.0/org_broadinstitute_hellbender_tools_copynumber_ModelSegments.php)):
> (Optional) Allelic-counts file containing the counts at sites genotyped as heterozygous in the case sample (.hets.tsv). This is a tab-separated values (TSV) file with a SAM-style header containing a read group sample name, a sequence dictionary, a row specifying the column headers contained in AllelicCountCollection.AllelicCountTableColumn, and the corresponding entry rows. This is only output if normal allelic counts are provided as input.
It seems like these genotypes somehow incorporated the normal counts, which I guess is similar to what you are doing with [countPysam.py](https://github.com/gavinha/TitanCNA/blob/7a9fa7c14f196f8f5eb91e6dcaa6a1de85de065d/scripts/snakemake/code/countPysam.py)? One thing that is clearly missing is the quality score, but even in the GATK4 beta Titan conversion script they simply seem to leave this blank (see [here](https://github.com/broadinstitute/gatk/blob/4.beta.6/src/main/java/org/broadinstitute/hellbender/tools/exome/conversion/titanconversion/TitanAllelicCountWriter.java#L26)).
Thanks much for sharing this and for all the detective work resurrecting the old conversion scripts. You're exactly right that these are essentially file conversions of existing files and I'd implemented similar approaches to what you did as part of testing this out:
The raw denoised log2 values (1000bp bins) versus segmentation bins comparison makes good sense as well since segmentation is designed to smooth out the noisiness. I've been feeding the denoised 1000bp inputs to TitanCNA in my initial tests with GATK4 inputs, wanting to give it the extra information present in the smaller bins but again will defer to Gavin's expertise.
Glad we're approaching this the same way and making progress. Thanks again.
Happy to close this issue if no further action required, are you OK @chapmanb?
I feel confident using TITAN with the denoised segments from GATK4 as input, and the hets from ModelSegments
.
The two plots in my post above are both CNA.pdf
from using the denoised segments (top) and the larger segments from ModelSegments (bottom) as TITAN input.
In another post, Gavin writes:
The copy number segments are usually nice to look at but it is more difficult to assess the solutions. The plots that I like to use are the CNA.pdf and LOH.pdf.
To me it suggests that we want the CNA.pdf file to show a lot of data points, as in the top plot. Therefore, we should use the DenoiseReadCounts
output rather than the ModelSegments
output as TITAN input.
If one uses the ModelSegments
output as TITAN input, the CNA.pdf
and CNASEG.pdf
files look almost identical. This cannot be correct imho.
Hi @fpbarthel and @chapmanb
Thank you both for looking into this.
I feel confident using TITAN with the denoised segments from GATK4 as input, and the hets from
ModelSegments
.
This is what I would recommend as well because it is the input that TitanCNA expects.
To me it suggests that we want the CNA.pdf file to show a lot of data points, as in the top plot. Therefore, we should use the
DenoiseReadCounts
output rather than theModelSegments
output as TITAN input.If one uses the
ModelSegments
output as TITAN input, theCNA.pdf
andCNASEG.pdf
files look almost identical. This cannot be correct imho.
Well, it's not necessarily incorrect. Using ModelSegments
just means that the heavy lifting of coming up with segmented (smoothed) data is done by GATK instead of handled by the TitanCNA HMM, as intended. If users feel more confident using GATK's segmentation prior to TitanCNA's own HMM segmentation, then I think it could still work. There may be some tweaking of parameters such as --alphaK
to adjust the hyperparameters on the prior of the Gaussian variance since there is much lower variability in the input.
Best, Gavin
Gavin and Floris; Thanks much for this, it's great to hear that our intuition about which GATK CNV input files to use worked out. I've also been doing comparisons using three different TCGA cancers with both TitanCNA and PureCN. The GATK CNV approach with denoised segments and hets works well and performs similarly to CNVkit input that I've also got in place. The sample by sample comparisons for the different cancer types are here if you have interest:
https://github.com/bcbio/bcbio_validations/tree/master/TCGA-heterogeneity
Thanks again for all the help and discussion, it's great to have this all in place in bcbio with some confidence bounds around the outputs. Much appreciated.
Hi to both,
I just wanted to briefly follow up with a concern with using the GATK workflow directly. To calculate hets, ModelSegments
first calls het site in the control sample and finally performs a pileup of those sites in the tumor sample. However, both tumor and normal hets are filtered using a (default) minimum depth threshold of 30x. This result in a proportion of the het sites being discarded due to insufficient coverage.
Looking at some of my results, I am noticing TITAN has called a far lower number of CDKN2A homozygous deletions than I am expecting (15% as opposed to 60% out of 600+ samples) and it seems that there are very few hets reported in this region in several samples.
This has led me to believe that perhaps setting a minimum (or maximum?) depth threshold for the tumor sample is not appropriate? Especially when we assume the same experimental strategies were used to generate the data for the tumor and matching normals, we should be able to expect equal coverage of this region in both tumor and control UNLESS there is a deletion in the tumor.
@gavinha I would be very interested to hear your thoughts on this.
--minDepth
parameter, and while at 10x default this is much more generous than the GATK ModelSegments
default of 30x, I am wondering what is the reason for this parameter? Is the absence of reads at het sites not telling for a deletion?Floris
P.s. I've also raised my concern to the GATK team here
P.p.s. related to a similar issue which reported the same for amplifications https://github.com/gavinha/TitanCNA/issues/50
Hi @gavinha wonder if you have any thoughts on this? I've noticed that the CDKN2A
region is often missing from my results.
Floris
Hi!
I'm wondering how to use the new GATK4 copy number segmentation as input to TITAN. I guess it should be possible (see eg. https://software.broadinstitute.org/gatk/documentation/article?id=11088) but I'm not sure how. I was able to use a recent pipeline to convert its output to a format that mirrors the Allelic CapSeg output, which could be used for eg. ABSOLUTE. But I'd like to use TITAN.
Any advice?
Thanks!