Closed schelhorn closed 5 years ago
Sven-Eric -- thanks for starting this discussion. I'm totally agreed that we need to clean up how we represent somatic and germline samples in the MultiQC report. These are great suggestions and we'll work on getting them implemented. Happy to hear thoughts from other people dealing with MultiQC reports as well. @ohofmann and @vladsaveliev, do you have thoughts as well?
That's fairly high on the todo list here as well. Will discuss timelines with @vladsaveliev once he makes it to Melbourne ;-)
Great catch, @ohofmann :) Looking forward to it.
So, in summary I would propose the following fixes:
multiqc
, even if running paired batches, analyze all samples in single-sample mode based on the single-sample bam and vcf results that are already generated by default. Name results of these stats according to the sample name. multiqc
variant call stats the batches as extra records, using the batch name. In multiqc
results that show bam-derived statistics in the same row as vcf-derived statistics (like the "General" tabular output), the bam-derived stats have to be empty since there is no single bam alignment for a batch.-germline
suffix added to the sample name. Hi Sven-Eric, Oliver, Brad, sorry that this issue has slipped away from my attention. Cleaning up MultiQC for the paired and germline calling was also in my plans. Sven-Eric, thanks a lot for the comprehensive proposal, I'll stick to it and try to get it done this week.
One issue is that the paired callers that I know don't output single-samples VCFs - they take 2 bams and output just one VCF. So in order to QC tumor calls before subtracting normals, we need to somehow split the resulting VCF and feed the tumor part into bcftools (and also run it against snpeff one more time in order to get its stats). Splitting would be tricky since the number are reported PASSed variants only, and it's not clear how to distinguish failed tumor calls from germline tumor calls in a batch VCF. And I'm not sure if we want this overhead at all.
It seems to me that the variantcaller: germline
setting is exactly made for the cases when single-samples calls are of interest in addition to paired. So I would propose to just run germline calling on tumor single samples in addition to normals, so when germlines are requested, we'll have them for all samples, and we'll be able to nicely report them in MultiQC just using sample names without the germline
suffix, like on the screenshot:
(syn3
is the batch name, with syn3-tumor
as a tumor sample and syn3-normal
as a normal sample).
Do you think it makes sense?
+1 from me, since I get asked all the time if the germline variants I found "are also in the tumor sample".
@lbeltrame, the tumor-only VCFs should have been generated already - this issue is about the multiqc
part only.
Vlad, Luca and Sven-Eric; Thanks for all the discussion on this. The thought process is that the germline represents the pre-existing baseline in the pair and somatic calls are the differences from that baseline. So I don't think we need to do extra work germline calling on the tumor. The information is there from both but we do a poor job of representing it in the multiqc report.
A synthesis of what Sven-Eric suggested and Vlad mocked up makes the most sense to me. You really have three different things you want to see:
germline
My thought had been to have syn3-tumor and syn3-normal as Vlad shows and then attach the somatic calls to the tumor and germline calls to the normal. I'd also include the batch name as part of this so they group together:
syn3: syn3-tumor (somatic) -- tumor align stats + somatic call stats syn3: syn3-normal (germline) -- normal align stats + germline call stats
What do you think about this approach?
That would work for me, at least, so +1.
Brad, what you are suggesting is basically what we already have now, except for the naming part. This is how the report looks in the current version:
So your suggestion is basically just to merge syn3-normal
and syn3-normal-germline
into one line:
That sounds reasonable. However I think what Sven-Eric has requested is the single-sample variant statistics for the tumor samples, not only the batch derived. Correct me if I'm wrong.
We actually could go beyond the simple bcftools
stats, and within bcbio iterate over the batch VCFs and report numbers of somatic calls, germline calls, LOH variants, etc. That will give additional information, however I'm thinking that the current somatic-only stats already give idea if the tumor sample is alright or not.
I'd really like to see tumor-only vcf statistics as well if the tumor calls are available. Sample quality is important to us, and the somatic part can only tell you that something is broken, but not what. Having tumor, germline, and somatic call statistics gives you the full picture.
Hmm, what kind of statistics for a tumor could we report, given that what we we only have is a paired VCF file with all tumor germline calls rejected in the FILTER field? Would you propose to change how we count stats with bcftools, and instead of counting only PASS make it count all records regardless of status (for single samples)?
That would also introduce a lot of poor quality calls into the report -- unlike the germline
calls which are filtered best practice calls for the matched normal. Also not sure what else to report there short of also running dedicated germline callers on the tumour sample?
I see your points. We often have questions that only can be answered in single sample analyses, such as identification of mislabeled samples.
Therefore, ideally I'd like the option to optionally call and QC the tumor samples separately also in paired mode (similar to the normal samples that are separately called against the reference in the optional germline mode).
But if the community doesn't think that adds much to the table, or is to much of implementation work, then I won't make to strong a point of it. In that case we would have to analyze the cohort twice, once in tumor-only and once in paired mode.
One way to call for tumor samples separately would be to specify a unique batch name for them, e.g. in the example above, it would be:
metadata:
batch: [syn3, syn3-tumor-batch]
phenotype: tumor
Though unfortunately it's not going to end up in MultiQC.
We've been been talking with Brad on how we can expose germline calls for tumors in t/n pairs, here is some summary:
What bcbio does at the moment:
tumor
, 2 VCFs are generated: somatic and germline with -germline
suffix. In MultiQC, only somatics exposed for each sample (either ensemble or the first caller), germlines are ignored.<norma_sample>-germline
. Extra samples are also uploaded into final in dedicated folders.What we want to change:
It brings up several issues:
batch: [syn3, syn3-tumor-batch]
)? I would shelf it for now and just report for the first batch.<normal>-germline
and <normal>
into one row, like in my very first screenshot? In that case, for consistency with the file system folders, it makes sense to also avoid uploading extra -germline
folders for normals and merge them there as well. However, germline counts are not really comparable against the somatic counts, so having them in the same table interspersed kind of breaks the idea of sortable table. Also, as the number of samples grows, the table starts to expand very quickly too. Maybe it makes sense to put another table for germline calls stats.Thanks, Vlad has done a bunch of work cleaning up the stats on these variants, and this issue is stale so closing it. Please re-open if there is some specific stuff we could be doing better.
When doing paired variant calling, how can I have both the paired (somatic) and unpaired variant statistics in multiqc? Right now, when specifying germline as well as somatic callers, I get multiqc 'general statistics' and 'bcftools' stats for:
Any way I can get variant QC info for the tumor samples as well, and batch names instead of tumor-sample-names for the paired results?