Paired and unpaired statistics on variant calls in multiqc

schelhorn commented 7 years ago

When doing paired variant calling, how can I have both the paired (somatic) and unpaired variant statistics in multiqc? Right now, when specifying germline as well as somatic callers, I get multiqc 'general statistics' and 'bcftools' stats for:

somatic variations under the name of the tumor samples (i.e. not the batch name as one would expect)
no single-sample info about tumor samples (so no way to see if a tumor sample is broken)
but: single-sample info about the normal sample in the batch (as -germline). That's fine, I just need tumor samples here as well.

Any way I can get variant QC info for the tumor samples as well, and batch names instead of tumor-sample-names for the paired results?

chapmanb commented 7 years ago

Sven-Eric -- thanks for starting this discussion. I'm totally agreed that we need to clean up how we represent somatic and germline samples in the MultiQC report. These are great suggestions and we'll work on getting them implemented. Happy to hear thoughts from other people dealing with MultiQC reports as well. @ohofmann and @vladsaveliev, do you have thoughts as well?

ohofmann commented 7 years ago

That's fairly high on the todo list here as well. Will discuss timelines with @vladsaveliev once he makes it to Melbourne ;-)

schelhorn commented 7 years ago

Great catch, @ohofmann :) Looking forward to it.

schelhorn commented 7 years ago

So, in summary I would propose the following fixes:

in multiqc, even if running paired batches, analyze all samples in single-sample mode based on the single-sample bam and vcf results that are already generated by default. Name results of these stats according to the sample name.
if batches are analyzed, add in the multiqc variant call stats the batches as extra records, using the batch name. In multiqc results that show bam-derived statistics in the same row as vcf-derived statistics (like the "General" tabular output), the bam-derived stats have to be empty since there is no single bam alignment for a batch.
if germline calling is desired, treat it like the first item but with the -germline suffix added to the sample name.

vladsavelyev commented 7 years ago

Hi Sven-Eric, Oliver, Brad, sorry that this issue has slipped away from my attention. Cleaning up MultiQC for the paired and germline calling was also in my plans. Sven-Eric, thanks a lot for the comprehensive proposal, I'll stick to it and try to get it done this week.

vladsavelyev commented 7 years ago

One issue is that the paired callers that I know don't output single-samples VCFs - they take 2 bams and output just one VCF. So in order to QC tumor calls before subtracting normals, we need to somehow split the resulting VCF and feed the tumor part into bcftools (and also run it against snpeff one more time in order to get its stats). Splitting would be tricky since the number are reported PASSed variants only, and it's not clear how to distinguish failed tumor calls from germline tumor calls in a batch VCF. And I'm not sure if we want this overhead at all.

It seems to me that the variantcaller: germline setting is exactly made for the cases when single-samples calls are of interest in addition to paired. So I would propose to just run germline calling on tumor single samples in addition to normals, so when germlines are requested, we'll have them for all samples, and we'll be able to nicely report them in MultiQC just using sample names without the germline suffix, like on the screenshot:

screen shot 2017-10-04 at 18 54 02 (syn3 is the batch name, with syn3-tumor as a tumor sample and syn3-normal as a normal sample).

Do you think it makes sense?

lbeltrame commented 7 years ago

+1 from me, since I get asked all the time if the germline variants I found "are also in the tumor sample".

schelhorn commented 7 years ago

@lbeltrame, the tumor-only VCFs should have been generated already - this issue is about the multiqc part only.

chapmanb commented 7 years ago

Vlad, Luca and Sven-Eric; Thanks for all the discussion on this. The thought process is that the germline represents the pre-existing baseline in the pair and somatic calls are the differences from that baseline. So I don't think we need to do extra work germline calling on the tumor. The information is there from both but we do a poor job of representing it in the multiqc report.

A synthesis of what Sven-Eric suggested and Vlad mocked up makes the most sense to me. You really have three different things you want to see:

The alignment and coverage stats for tumor and normal.
The somatic variant calls
The germline variant calls, if specifying germline

My thought had been to have syn3-tumor and syn3-normal as Vlad shows and then attach the somatic calls to the tumor and germline calls to the normal. I'd also include the batch name as part of this so they group together:

syn3: syn3-tumor (somatic) -- tumor align stats + somatic call stats syn3: syn3-normal (germline) -- normal align stats + germline call stats

What do you think about this approach?

lbeltrame commented 7 years ago

That would work for me, at least, so +1.

vladsavelyev commented 7 years ago

Brad, what you are suggesting is basically what we already have now, except for the naming part. This is how the report looks in the current version: screen shot 2017-10-05 at 08 52 12

So your suggestion is basically just to merge syn3-normal and syn3-normal-germline into one line: screen shot 2017-10-05 at 09 23 37

That sounds reasonable. However I think what Sven-Eric has requested is the single-sample variant statistics for the tumor samples, not only the batch derived. Correct me if I'm wrong.

vladsavelyev commented 7 years ago

We actually could go beyond the simple bcftools stats, and within bcbio iterate over the batch VCFs and report numbers of somatic calls, germline calls, LOH variants, etc. That will give additional information, however I'm thinking that the current somatic-only stats already give idea if the tumor sample is alright or not.

schelhorn commented 7 years ago

I'd really like to see tumor-only vcf statistics as well if the tumor calls are available. Sample quality is important to us, and the somatic part can only tell you that something is broken, but not what. Having tumor, germline, and somatic call statistics gives you the full picture.

vladsavelyev commented 7 years ago

Hmm, what kind of statistics for a tumor could we report, given that what we we only have is a paired VCF file with all tumor germline calls rejected in the FILTER field? Would you propose to change how we count stats with bcftools, and instead of counting only PASS make it count all records regardless of status (for single samples)?

ohofmann commented 7 years ago

That would also introduce a lot of poor quality calls into the report -- unlike the germline calls which are filtered best practice calls for the matched normal. Also not sure what else to report there short of also running dedicated germline callers on the tumour sample?

schelhorn commented 7 years ago

I see your points. We often have questions that only can be answered in single sample analyses, such as identification of mislabeled samples.

Therefore, ideally I'd like the option to optionally call and QC the tumor samples separately also in paired mode (similar to the normal samples that are separately called against the reference in the optional germline mode).

But if the community doesn't think that adds much to the table, or is to much of implementation work, then I won't make to strong a point of it. In that case we would have to analyze the cohort twice, once in tumor-only and once in paired mode.

vladsavelyev commented 7 years ago

One way to call for tumor samples separately would be to specify a unique batch name for them, e.g. in the example above, it would be:

  metadata:
    batch: [syn3, syn3-tumor-batch]
    phenotype: tumor

Though unfortunately it's not going to end up in MultiQC.

We've been been talking with Brad on how we can expose germline calls for tumors in t/n pairs, here is some summary:

What bcbio does at the moment:

For each unpaired sample labelled as tumor, 2 VCFs are generated: somatic and germline with -germline suffix. In MultiQC, only somatics exposed for each sample (either ensemble or the first caller), germlines are ignored.
For each pair, 1 VCF is generated containing somatic tumor calls, and they are exposed in MultiQC. If "germline" calling is additionally required, germline callers are applied to normals, and are also exposed in MultiQC as extra samples named as <norma_sample>-germline. Extra samples are also uploaded into final in dedicated folders.

What we want to change:

For each tumor-only sample, also QC germline calls and expose them in MultiQC.
For each pair, extract germline calls from paired VCFs (will be tricky and caller-specific, but hopefully doable), and expose them in MultiQC as well.

It brings up several issues:

Should we account for special cases like a tumor sample with several batches (e.g. with different normals or without normals, like the one above batch: [syn3, syn3-tumor-batch])? I would shelf it for now and just report for the first batch.
Do we want to merge <normal>-germline and <normal> into one row, like in my very first screenshot? In that case, for consistency with the file system folders, it makes sense to also avoid uploading extra -germline folders for normals and merge them there as well. However, germline counts are not really comparable against the somatic counts, so having them in the same table interspersed kind of breaks the idea of sortable table. Also, as the number of samples grows, the table starts to expand very quickly too. Maybe it makes sense to put another table for germline calls stats.

roryk commented 5 years ago

Thanks, Vlad has done a bunch of work cleaning up the stats on these variants, and this issue is stale so closing it. Please re-open if there is some specific stuff we could be doing better.

bcbio / bcbio-nextgen

Paired and unpaired statistics on variant calls in multiqc #2081