comparing two samples in terms of mapping qualities

mortunco commented 5 years ago

This is not an issue but rather asking for advice.

I have two cohorts A and B. Variant calls from A and B are very different in terms of quality. I couldnt find anything on variant calling so i want to compare mapped files (bams).

In the simplest way.

both cohorts have their data on bam forma(mapped on hg19 with bwa).

So I am proccesing them as follows.

bam --> (samtools sort -n | samtools fastq) --> Isaac4 --> StrelkaSomaticVariant Calling.

isaac-align -r sorted-reference.xml -b fastq-dir -o output -t temp-space-path  --base-calls-format fastq-gz -m 700 -j62 --verbosity 0

2019-05-29 11:07:51     [7f16d4d4b7c0]  Forcing LC_ALL to C
2019-05-29 11:07:51     [7f16d4d4b7c0]  Version: Isaac-04.17.06.15
2019-05-29 11:07:51     [7f16d4d4b7c0]  argc: 2 argv: ./isaac-align -v
Isaac-04.17.06.15

So my questions,

1) What should be my measure of comparison? I will definetely read the mapping scoring in detailed but I see lots of MAPQ0 in the sam. Is there an another measure from Isaac that I can use it?

Thank you for your help and time,

Best regards,

Tunc

come-raczy commented 5 years ago

Hi Tunc,

This question is a bit far away from the topics expected to be covered here and unfortunately I don't have a good answer to your question.

As an initial sanity check, I would first do a few things:

verify that the cohorts are expected to be comparable (what instrument, sample type, sample prep, sequencing recipe, mapper version and options, variant caller version and options, etc.)
check if there are noticeable differences in the structure variants and indels (increased numbers in germline might be an indication of initial sample quality)

As a first approximation, I would guess that if the variant calls have very different quality scores between the cohorts (assuming exact same versions of mapper and variant caller and exact same command line options) it is probably because there is a difference in noise (e.g. DNA damage, sequencing errors, mapping errors, etc.) and in signal (e.g. local coverage at each variant locus, lower MAPQ for the reads used for the calls, shorter inserts leading to overlapping reads, etc.). Some of this could potentially be detected from analyzing the BAM file but there isn't any magic bullet, far from it.

The first thing to do is probably to discard all the reads that don't contribute to the variant calls (I believe that MAPQ < 20 is the default cutoff but that needs to be verified). Comparing the coverage of the remaining reads would be a good starting point. It might be better to estimate the local coverage around each variant and see if variations in coverage correlate with variations in variant score calls. It might also be worth checking the fraction of MAPQ 60 reads vs 20-59 MAPQ reads (possibly stratified more finely).

Beyond MAPQ and coverage, it might be worth getting an estimate of the mismatch rates, possibly stratified by MAPQ (MAPQ 60 reads should have a really low mismatch rate that would be almost exactly the total of mismatches from sequencing errors and of mismatches from variants while MAPQ 20 reads would also include mismatches from incorrect mapping positions).

Beyond that, I am afraid that high level analysis of the BAM files won't provide much more help. The next step would be a fine analysis of the CIGAR strings and of the semi-aligned reads but that's probably a research subject by itself. Before going there, I would probably go back to the variants and use a combination of the VCF data and IGV to try to eyeball the root cause of the differences.

Good luck and thanks a lot for using Isaac4.

Come

mortunco commented 5 years ago

thank you very much for your time!

Illumina / Isaac4

comparing two samples in terms of mapping qualities #12