RFC: Implement DNA Damage and Response indicators as bcbio QC outputs

schelhorn commented 7 years ago

I was wondering if anyone else would be interested in adding DNA Damage and Response (DDR) indicators as (perhaps optional) bcbio QC outputs. These indicators are of interest to pharma R&D in order to identify samples with deficits in specific DNA repair pathways (there are several ones). Currently there are about eight to ten different ways to derive such indicators described in the literature that all involve postprocessing VCFs from somatic SNVs, CNVs, or SVs (plus LOH, although that can be viewed as a mixture of SNV and CNV signals). So all the required inputs are already generated by bcbio.

Most methods for deriving these indicators require counting classes of variation events over certain genomic intervals and reporting these in predefined sub-categories (often called "signatures") in the form of a number of aggregated counts per sample. For SNVs, that usually results in up to 96 different signatures, for SVs in about 32, and for CNVs in a hand full more.

So from the standpoint of implementation, deriving these indicators would mean walking over all [SNV, SV, CNV] VCFs once in a single Python process and fill a matrix of dimension ~[NRBATCHES]*140 in memory with integer counts. This matrix could then be added to MultiQC and/or written to disk as a separate tab-delimited file output. User would then load this matrix into their statistical environment of choice and apply NNMF or clustering to derive the affected DDR pathways (there are existing implementations in R for that). However, even direct comparing of counts can be insightful, similar to the DKFZ DNA damage code that already is in bcbio for more general QC purposes.

We already have implementations for some of the DDR indicators in R and would probably re-implement them over to Python to be directly integrated in bcbio. Is anyone else from the user side interested in doing that as well, @mjafin perhaps? Brad, how do you feel about such forays into deriving genomic markers, is that in scope in bcbio at all?

mjafin commented 7 years ago

Hi Sven-Eric, With our heavy focus on DDR this is obviously something we've been doing in our post-processing, parts of which could be considered the "secret sauce". If there are some IP-free approaches to predicting DDR indications that'd be interesting to add to bcbio

schelhorn commented 7 years ago

Alright, I see :) May I ask if you are doing that directly in bcbio or in an external step? Depending on your interests we could collaborate in having a general approach for accessing and iterating of VCFs in bcbio and generating counts for defined classes of events, but leave it to each organization to define the exact rules for counting and aggregating in private. If you prefer do do your processing externally to bcbio, though, (as we are doing it at the moment), then that wouldn't be viable, of course. There are a couple of IP-free approaches in the literature.

mjafin commented 7 years ago

External steps. Looking at actionable variants and losses in DDR genes and also having goes at DDR scores. There are some interesting approaches also coming out of the Sanger like the recent HRDetect algorithm but they're under IP I believe

schelhorn commented 7 years ago

I see; we're doing that as well and are mostly interested in re-implementing the newer DDR indicators in a manner that does not conflict with existing IP. Generating counts from VCF files should be safe in any event, though - I don't think that's a patentable invention in itself regardless for which purpose. HRDetect has a application underway, but even if it will be granted that probably will only happen in combination with their (overfit) classifier to actually select patients for therapies. I'm only interested in counting.

mjafin commented 7 years ago

Sounds like something that would be useful for everyone.

We're looking to implement very basic versions of chromosomal instability number (CIN) and tumour mutational burden in bcbio reports shortly and also MSI status in the slightly longer run. Not directly DDR related but might be helpful nevertheless. All these metrics quickly get confounded by sample quality (differences in depth, input material (FF, FFPE, plasma)) so might end up being fancy QC indicators :) I can see the same happening for the fancier DDR scores if the input material isn't always of high and consistent quality.

schelhorn commented 7 years ago

Alright, that seems to be a great start :) If you'd be willing to push some of these QC measures to main that would be really helpful. We'll start to do the same for some of ours that are generally useful, including some DDR signatures, but that will take some weeks I'd imagine.

etal commented 7 years ago

We've seen good results from MSIsensor for detecting microsatellite instability. It runs on a pair of BAM files, rather than VCFs, and in our validation performed at least as well as the existing lab test, a Promega kit. It's quick once a genome-specific index file is built, which itself takes some time initially.

To build the index for our pipeline I used these parameters:

msisensor scan -o msi.hg19.list -d hg19.fa -l 10 -s 2 -r 7

After the program is run, one of the output files lists the percentage of microsatellite sites that tested as unstable. A reasonable cutoff for calling MSI-high in this table is 30%, and below 10% is MSS. Literature varies a bit in the definitions of MSI-high and the perhaps illusory MSI-low status, and is mostly in reference to colorectal cancer based on kits like Promega with a very small number of microsatellite targets.

We haven't had much luck detecting mutational signatures from target panel sequencing of individual cases, so I'm interested to see solutions/ideas here.

chapmanb commented 7 years ago

Sven-Eric, Miika and Eric; Thanks for this discussion. It would be great to have more of these metrics and QC if we can identify practical open source implementations. This is a big space with a lot of potential options. From my perspective the challenge is prioritizing the useful metrics and then starting to tackle them. I'd love to see the implementations be included in bcbio but depending on the amount of algorithmic work it might be a separate program that bcbio calls so folks can use and develop it outside of this framework. Thanks again for these great discussions.

ohofmann commented 7 years ago

Very interested in this as well from an academic point of view, particularly wrt patient sequencing and allocation to clinical trials. Right now we rely on somatic signatures and annotated variants / pathways only.

mjafin commented 7 years ago

@etal thanks for the tip on MSIsensor. Have you tried it on tumour-only samples and/or targeted panels? We rarely have matched normals.

I've been looking at something very basic, namely variant calling in regions of known MSI.

etal commented 7 years ago

@mjafin We've been using it on a 500-gene target panel, where it does a meaningful test on ~1200 microsatellite sites with sufficient coverage in a typical sample. It would probably be fine on smaller panels, too. (For comparison, lab kits like Promega use PCR to test 5-20 sites.)

But it does require a matched T/N pair, and I don't see an easy way around that. At each microsatellite site, i.e. sufficiently long mono- or dinucleotide repeat in the reference genome, MSIsensor counts the repeat lengths in each of the mapped reads at the site in the tumor and normal samples, then compares the two distributions with a simple chi-squared test. The problem is that the normal repeat lengths at each microsatellite are unique to each patient. It might work well enough to pool a bunch of normal samples prepped with the same library protocol and downsample to a reasonable average depth, but I haven't tried that myself, and the authors don't describe anything like that in the paper or manual.

roryk commented 5 years ago

Thanks, we have implemented some damage filtering in bcbio since this issue was opened, so I'm closing this for now. Please reopen if there is stuff we could be doing better in reporting or accounting for possible damage.

bcbio / bcbio-nextgen

RFC: Implement DNA Damage and Response indicators as bcbio QC outputs #1879