NCI-CGR / IlluminaSequencingAnalysis

All Illumina Sequencing Related project from Xin will be recorded in this repo
0 stars 0 forks source link

NovoAlign VS BWA #31

Open lxwgcool opened 3 years ago

lxwgcool commented 3 years ago

This ticket will discuss the differences between NovoAlign and BWA in terms of the reads MAPQ in alternative sequence set.

lxwgcool commented 3 years ago

Original Question

In slide 5

cgr data (top) and USU data (bottom) for same subject. The reads in this region seem to map equally well to the main chr5 as well as an alternate version. Likely due to the difference in how novoalign and bwa handles these situations. All reads in the USU version have MQ=0 on the main chromosome. I’m unclear if this would cause issues with variant calling in this region of chr 5.

NP0551-HE1-CGR-AND-USU.pptx

lxwgcool commented 3 years ago

For BWA-mem

Based on my survey, it will always set MQ=0 to ambiguous mapped reads (equally well). There is no option provided to us to change this behavior (e.g. set the real mapping score in current position (similar as NovoAlign)).

lxwgcool commented 3 years ago

Discuss with Colin to see how NovoAlign handle this case (1st)

Hi Xin,

First, the images are so small and low resolution I can't really see much.

1: May I know if my conclusion is correct “the MQ is always the real mapping quality no matter whether or not “the reads map equally well to other locations in the reference genome” (default setting)” ? MAPQ is the real mapping quality. 2: Would “multi-aligned” (equally-well / non-equally-well) affect the mapping quality score of reads? Yes, by definition. 3: Can we let NovoAlign use the same logic as BWA to treat the mapping quality score for multi-aligned reads (mapped equally well). No, though in theory they are the same. Why would you want Novoalign to mimic BWA when your BLAST results support Novoalign? 4: Do we have the case of “mapping quality=0” in NovoAlign BAM file? If it does, may I know what does “MQ=0” stand for in NovoAlign BAM file? Yes, it means there are more than 4 equally well mapped reads. See below for an explanation.

Regard MAPQ we put a lot of effort both in design of the algorithm and during mapping to make sure we get an accurate MAPQ.

MAPQ is defined as -10log10(Perr).

MAPQ calculation is based on Perr = 1- Pcorrect where Bayesian stats are used to calculate the posterior probability of the alignment using all the alignments found. BWA uses a similar algorithm but as it's mapping algorithm is very different it will find and use a different set of mappings in the calculation.

If we consider some very simple examples..

When only 2 equally good alignments are found either one has Perr=0.5 so MAPQ -10log10(0.5) = 3 3 equally good alignments gives a MAPQ of 2 4 is MAPQ 1 any more than that should get a MAPQ of zero.

MAPQs higher than 3 are never from equally well mapped alignment locations though the difference in how well they are mapped may be small.

Novoaligns alignment score is designed to calculate the -10log10(P) where P is probability that the reference sequence we are mapping to could have been the source of the read. It takes into account mismatches, gaps, base qualities, ambiguous IUPAC codes in the reference and, for pairs, the fragment length. The calculations are way more thorough and complete than in BWA.

The scores from all the alignments found for the read are combined to calculate the posterior probability that the best alignment is correct. The MAPQ is capped at 70.

Now, with regard your example. I think you can see from BLAST that there is only one mapping location. Is that correct? In which case the real question is why has BWA given it a MAPQ of zero?

What MAPQ was novoalign producing? Is it <= 3? Your images don't show this.

If you are using bwa mem then it should output an XA tag with a list of the extra alignment locations. You could check these. Where rae they? Why don't they show up in BLAST?

You can also get novoalign to report multiple alignment locations with the -r & -R options. Try -r All -R40 using a few of the reads from this region.

I think minimap2 can also print multiple mappings so you could check it's MAPQ in this region and whether it finds multiple mappings.

The other thing that I would check is that you are using the same fasta files for BWA reference and foe novoindex. May be just copare the @SQs in the two BAMs.

lxwgcool commented 3 years ago

Discuss with Colin to see how NovoAlign handle this case (2nd)

Just a quick response without yet looking at your ppt.

Novoalign V4 we treat alt_scaffolds in a fairly unique way compared to other read mappers.

This is covered in section 4.3.3 of the manual and also the attached document that details expected differences with BWA.

You can get bwa like behaviour with the option --alt off

We actually think our method is really useful..

The MAPQ on the main chromosomes will be very similar to what they would be if the alternate scaffolds did not exist. This allows variant calling on the main chromosomes to continue to work naturally even in the presence of the alternate scaffolds.

Also, the MAPQs on the alternate scaffolds will have the values they would have if the alternate scaffold was substituted into the main chromosomes. This allows you to call variants on the alternate scaffolds.

There is an additional tag, ZA, that has a MAPQ value calculated from all the mappings and it has values similar to BWAs MAPQ. We use this to do an analysis of which alternate scaffolds are "true" though this doesn't work because the alternate scaffolds in GRCH37 & 38 don't correspond with single haplotypes but are usually mixes.

Best Regards, Colin


Hi Xin,

Just one more point I forgot to mention.

If you compare mapping (BWA) and variant calling with & without the alternate scaffolds, if I remember right, the alternate scaffolds cause about 3% extra low MAPQ reads and hence about 3% of the reference where you won't be able to call variants. This is a pretty big price to pay.

If you are going to go the BWA MAPQ route then you need some way of dealing with sequence that's duplicated between the main chromosomes and the alternate scaffolds or you'll miss any variants in these regions.

Take care, Colin

lxwgcool commented 3 years ago

Discuss with Colin to see how NovoAlign handle this case (3rd)


From Xin

Picture1


From Colin Hi Xin,

Q1. The alignment with the lowest alignment score is considered the best mapping and will have the primary flag set. In your example the mapping to Alt2 is shown with a score of 59 so it's the best mapping. The other mappings which are in the same alt region (or group) are the surrogates and will have primary and supplementary flags set.

All the mappings in your example are considered as alternate mappings. They form a group of alternative mappings for the read.

These terms are ones we try and use so we can illustrate our algorithm.

Each of the mappings you illustrated will have it's MAPQ calculated as though the other mappings did not exist. The MAPQ for the mapping on the main chromosome will not consider the other alternate mappings for the read. It will take into account any mappings outside this alternate region.

If there were no other mappings other than the ones you illustrated all the mappings would have a MAPQ of 70.

It gets more complicated with reads that map in multiple locations (other than to other alts).

If you consider your example, and let the read map twice to alt-scaffold 2, once with a score of 59 and a second time with a score of 60. And the other alternate sequences still only have one mapping.

  1. The mapping with a score of 59 is still the best mapping and will be primary, not supplementary, but will have a low MAPQ (3 or 4) due to the second alignment on that scaffold. which will have a similar low MAPQ, maybe 2.
  2. The other scaffolds which have only 1 mapping will get high MAPQ. This is as if they were the only sequence and mappings in that region of alternates.

Q2. I think above may answer thisbut just a bit more.. Consider this illustration

Capture2222

There is one extra mapping with a similar score outside the alternate region. This mapping will be used in calculating the MAPQ for all the other mappings and each will end up with a MAPQ close to 3. with some slight difference due to the score differences. i.e For each mapping to the alternates we consider the alternate mapping and the extra mapping outside the alternate region when calculating the MAPQ.

The MAPQ for the mapping outside the alternate region would be calculated using its alignment score and only the best alignment score(59) from the group of alternates.

lxwgcool commented 3 years ago

Summarization


From Xin

Dear Colin,

Thanks so much for your comments. Now I understand more about the logic of MAPQ in NovoAlign.

I did a brief summarization below in the general perspective (main idea, assume reads can be mapped back to ref seq perfectly):

1: For the case of unique mapping in "alternative mapping set" (each ref seq only contain 1 mapping): (1) The "Best Mapping" (lowest mapping score) will be marked as "Primary" (2) All "surrogate mappings" will be marked as both "Primary" and "Supplementary" (3) MAPQ will be calculated separately for each ref seq (as though the other mappings did not exist). In other words, all mappings would have a high MAPQ score.

2: For the case of multi-mapping 2.1: If the additional mapping was only located in one of the "Alt Scaffold": (1) The MAPQ in such "Alt Scaffold" (contains 2 mappings) will be changed (e.g. the score may decrease to 3) (2) Other "alternative mapping" (only have 1 mapping in either Alt scaffold or Main Chromosome) will still keep the high MAPQ score.

2.2 If the additional mapping was located in "main chromosome sequence" (as the example you mentioned): (1) The MAPQ in "main chromosome sequence" (contains 2 mappings) will be changed (e.g. the score may decrease to 3) (2) Other ""alternative mappings" in different Alt scaffolds will be changed as well (e.g. decrease to 3). This is because for each "alternative mapping", we also need to consider such extra mapping outside the alternate region when calculating the MAPQ.

Please let me know if my summarization above is correct. Thanks again for your patient and the professional comments.


From Colin

That's a good way of explaining it. The only real thing missing is the case when a second mapping is in a different alternative sequence set. This works similar to 2 but when we are doing the MAPQ for alternative set 1 we use the best alignment from alternative set 2 in the MAPQ calculation and vice versa.

In some of my work I use the ZA score to identify the sub-regions of the alternative scaffolds that are present in the sample. A high ZA score will indicate that the read is mapped uniquely on that scaffold.


From Colin

I've attached another document which is a bit of a review of the various GRCH38 releases and how they've implemented alt_scaffolds. It also looks at which releases have vcf files that include SNPs on the alt_scaffolds. This was done quite a few years ago and hasn't been updated but it still might give you some ideas on what to look for.


Some valuable Doc: Novoalign-BWAMEMAlternateScaffoldMappings.pdf NovoalignReferenceManual.pdf GRCh38 & Alt Scaffolds.pdf NovoAlign_VS_BWA.pptx