RWilton / Arioc

Arioc: GPU-accelerated DNA short-read alignment
BSD 3-Clause "New" or "Revised" License
59 stars 8 forks source link

Mapping accuracy ? #19

Closed karlkashofer closed 2 years ago

karlkashofer commented 2 years ago

Dear Richard !

I try to use Arioc to re-map a BWA mapped WGS BAM File, but i get very weird results. In attached screenshot Arioc BAM is on top, bottom lane is the BWA BAM. I get many regions that show presumably wrongly mapped reads in Arioc. Arioc mapping was done with these settings:

<nongapped seed="ssi84_2_30" />
<gapped seed="hsi20_0_30" maxJ="64" seedDepth="4" Wmxgs="2,6,5,3" Vt="100" />

Any idea what could be wrong ? igv_snapshot

RWilton commented 2 years ago

Hello Karl --

I agree that the coverage looks "weird" but I'm not sure how you infer that anything is "wrongly mapped." More specifically, I see a big pileup of reads in the Arioc lane that isn't present in the BWA lane, so I suppose the question is whether those reads actually have better mappings elsewhere and shouldn't be mapped where they are.

I don't know of a quick way to get at that apart from examining how individual reads are mapped. If you ask the IGV to show you mapping metadata like AS and MAPQ by hovering over those Arioc reads, you should be able to get a sense of whether their mappings are reasonable (i.e., high AS, high MAPQ). We'd still have to find the same reads in BWA's lane, however, in order to be sure what was going on.

If you want to dig into this, I'd like to ask you for enough detailed information to reproduce what you're seeing -- and at that point it might be better to take this discussion offline (richard dot wilton at jhu dot edu) just to avoid cluttering this github conversation. When we figure it out, I will post the resolution here on github.

Just as a side note: I have never seen the point of "remapping" reads from SAM/BAM output. Read aligners use different heuristics, and there are nondeterministic behaviors (e.g., which mapping to report when a read has duplicate high-scoring mappings) that inevitably lead to read-mapping differences that don't really matter in practice. When you use one aligner's mappings as input to another aligner, you're essentially using the first aligner as an input filter for the second one, sort of like wearing two pairs of polarized sunglasses. Whether you're trying to do a head-to-head comparison of two short-read aligners or just re-aligning some reads with different alignment constraints, I'd recommend going back to the original FASTQ input instead of remapping.

Thanks.

karlkashofer commented 2 years ago

You are right. I checked and the bam i have as input contains 100% bwa mapped reads, and i dont have the original fastq. So this is kind of a biased comparison, alas its the only data i have atm. I was thinking its maybe a problem with the reference. In the previous run i used one with only the canonical chromosomes, not the 170 alt/decoy/whatever contigs.

However, even with the full, latest hg38 reference i still have these heaps of reads which do not seem to belong here: Arioc2_igv_snapshot These heaps are very common and scrolling through the genome i see them quite frequently. Need to investigate further, regards, KK

RWilton commented 2 years ago

Hello Karl --

I can think of at least three plausible explanations for what's going on here:

1) BWA's alignment scoring weights differ from the ones used in Bowtie and in Arioc. There might be certain read sequences whose maximum AS is found at different reference-genome locations by BWA and Arioc.

2) Arioc is reporting secondary mappings that are piling up in repetitive regions.

3) For reads that have multiple best mappings, BWA and Arioc "choose" different mappings to report. For example, perhaps Arioc tends to pile them up whereas BWA tends to spread them around or pile them up elsewhere. There is obviously no right way to report such mappings, but it shouldn't matter because such mappings have MAPQ <= 3 and are consequently filtered out downstream anyway.

Again, you might use the IGV to get an idea about what's going on by looking at read metadata. AS will tell us if the read mappings have acceptable similarity (which they probably do since the SNPs and short indels seem consistent). MAPQ will tell us if the reads have multiple high-scoring mappings. And, of course FLAG tells us if these are secondary mappings.

The bottom line is still to pick out a few reads that are mapped differently and see what happened. I wouldn't complicate things with the additional contigs until that's figured out. At that point you can just align the same reads twice -- once to GRCh38 and once to GRCh38+contigs -- and see what happens.

karlkashofer commented 2 years ago

I think you are right, i am comparing a possibly heavily postprocessed bam with raw arioc output. igv_snapshot

This is apples and pears. I close this issue until i have redone the mapping in bwa and arioc from the original fastqs. Sorry for the noise.