Grice-Lab / AlignerBoost

AlignerBoost is a generalized software toolkit for boosting Next-Gen sequencing mapping precision using a Bayesian based mapping quality framework
11 stars 2 forks source link

MQ now most commonly 250 #1

Open colindaven opened 7 years ago

colindaven commented 7 years ago

Hi,

thanks for an interesting approach.

I tried this on a normal BAM which I had aligned with BWA.

I ran alignerboost with

java -Xmx50g -jar ~/NAS01/programs/AlignerBoost/AlignerBoost.jar run filterPE -in xx_R1_s.bam -out xx_R1_s_alignerboost.bam --sort-method coordinate -v

@RG ID:xx_R1 SM:xx_R1.stuff @PG ID:bwa PN:bwa VN:0.7.15-r1142-dirty SM:xx_R1.stuff /home/bioinformatics/NAS01/bioinformatics/seqres/ZR/bwa0X/SBKv7.fasta xx_R1.fastq.gz xx_R2.fastq.gz CL:bwa mem -t 16 -M -R @RG @PG ID:AlignerBoost PN:AlignerBoost VN:v1.6.2 CL:-in xx_R1_s.bam -out xx_R1_s_alignerboost.bam --sort-method coordinate -v

On assessing with qualimap, I found the original BAM had an average MQ of ~46, and the AlignerBoost bam had an average MQ of 246. Most Alignerboost MQs are now 250 as far as I have looked.

Can you please explain this, is this behaviour expected, and are users expected to filter BAMs by custom AlignerBoost BAM tags ? Why not just correct the MQ directly ?

Thanks, Colin

e00011027 commented 7 years ago

Hi Colin,

Thank you for being interested in AlignerBoost. The main difference before and after filtering is actually the "MAX_MAPQ" values pre-defined by both software.

The "MAX_MAPQ" values are required for (nearly) uniquely mapped reads, which should have a phred-scale score of infinite or very large (mapping error ~= 0). However, SAM/BAM specification doesn't allow scores > 255, and score 255 indicates not available. Thus different aligners use different MAX_MAPQ scores to replace the infinity, while conservative tools (such as BWA) uses a smaller MAX_MAPQ (BWA-MEM = 60) due to its empirical nature. AlignerBoost bench-marked (unpublished) some different MAX_MAPQ thresholds, and a much higher MAX_MAPQ still makes difference for results, so we picked 250.

So for a better comparison, you need to first compare the proportion of <60 vs. = 60 for BWA-MEM and < 250 vs. = 250 for AlignerBoost. Than compare the distribution of the remaining. Please let me know your results, than you!

Best Qi

On Thu, Oct 13, 2016 at 9:46 AM, colindaven notifications@github.com wrote:

Hi,

thanks for an interesting approach.

I tried this on a normal BAM which I had aligned with BWA.

I ran alignerboost with

java -Xmx50g -jar ~/NAS01/programs/AlignerBoost/AlignerBoost.jar run filterPE -in xx_R1_s.bam -out xx_R1_s_alignerboost.bam --sort-method coordinate -v

@RG ID:xx_R1 SM:xx_R1.stuff @PG ID:bwa PN:bwa VN:0.7.15-r1142-dirty SM:xx_R1.stuff /home/bioinformatics/NAS01/bioinformatics/seqres/ZR/bwa0X/SBKv7.fasta xx_R1.fastq.gz xx_R2.fastq.gz CL:bwa mem -t 16 -M -R @RG @PG ID:AlignerBoost PN:AlignerBoost VN:v1.6.2 CL:-in xx_R1_s.bam -out xx_R1_s_alignerboost.bam --sort-method coordinate -v

On assessing with qualimap, I found the original BAM had an average MQ of ~46, and the AlignerBoost bam had an average MQ of 246. Most Alignerboost MQs are now 250 as far as I have looked.

Can you please explain this, is this behaviour expected, and are users expected to filter BAMs by custom AlignerBoost BAM tags ? Why not just correct the MQ directly ?

Thanks, Colin

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Grice-Lab/AlignerBoost/issues/1, or mute the thread https://github.com/notifications/unsubscribe-auth/AG-ecUwYtFlNyqj9Howzf0e-_HLa4JLfks5qzjYngaJpZM4KV6BI .

Qi Zheng, Ph.D. Bioinformatician of Dermatology, Perelman School of Medicine,University of Pennsylvania 421 Curie Blvd BRB II/III 1046-7 Philadelphia, PA 19104 http://goo.gl/maps/F4FHR Office: +1 (215)-898-0173

e00011027 commented 7 years ago

Hi Collin,

Forgot to mention another problem. AlignerBoost doesn't help (at all) for paired-end (PE) read mapping aligned by BWA-MEM. To date, BWA-MEM doesn't allow reporting > 1 alignments for PE-mapping, and there is nothing to improve or "boost" by AlignerBoost. Please see the published article for details. You can still run it, but the result won't even change. It is recommended to use the "best practice" pipeline by using the README and config files in examples. For PE-mapping, we recommand bowtie2 & AlignerBoost. for SE-mapping, most aligners are OK.

Best Qi

On Thu, Oct 13, 2016 at 10:12 AM, Zheng Qi e00011027@gmail.com wrote:

Hi Colin,

Thank you for being interested in AlignerBoost. The main difference before and after filtering is actually the "MAX_MAPQ" values pre-defined by both software.

The "MAX_MAPQ" values are required for (nearly) uniquely mapped reads, which should have a phred-scale score of infinite or very large (mapping error ~= 0). However, SAM/BAM specification doesn't allow scores > 255, and score 255 indicates not available. Thus different aligners use different MAX_MAPQ scores to replace the infinity, while conservative tools (such as BWA) uses a smaller MAX_MAPQ (BWA-MEM = 60) due to its empirical nature. AlignerBoost bench-marked (unpublished) some different MAX_MAPQ thresholds, and a much higher MAX_MAPQ still makes difference for results, so we picked 250.

So for a better comparison, you need to first compare the proportion of <60 vs. = 60 for BWA-MEM and < 250 vs. = 250 for AlignerBoost. Than compare the distribution of the remaining. Please let me know your results, than you!

Best Qi

On Thu, Oct 13, 2016 at 9:46 AM, colindaven notifications@github.com wrote:

Hi,

thanks for an interesting approach.

I tried this on a normal BAM which I had aligned with BWA.

I ran alignerboost with

java -Xmx50g -jar ~/NAS01/programs/AlignerBoost/AlignerBoost.jar run filterPE -in xx_R1_s.bam -out xx_R1_s_alignerboost.bam --sort-method coordinate -v

@RG ID:xx_R1 SM:xx_R1.stuff @PG ID:bwa PN:bwa VN:0.7.15-r1142-dirty SM:xx_R1.stuff /home/bioinformatics/NAS01/bioinformatics/seqres/ZR/bwa0X/SBKv7.fasta xx_R1.fastq.gz xx_R2.fastq.gz CL:bwa mem -t 16 -M -R @RG @PG ID:AlignerBoost PN:AlignerBoost VN:v1.6.2 CL:-in xx_R1_s.bam -out xx_R1_s_alignerboost.bam --sort-method coordinate -v

On assessing with qualimap, I found the original BAM had an average MQ of ~46, and the AlignerBoost bam had an average MQ of 246. Most Alignerboost MQs are now 250 as far as I have looked.

Can you please explain this, is this behaviour expected, and are users expected to filter BAMs by custom AlignerBoost BAM tags ? Why not just correct the MQ directly ?

Thanks, Colin

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Grice-Lab/AlignerBoost/issues/1, or mute the thread https://github.com/notifications/unsubscribe-auth/AG-ecUwYtFlNyqj9Howzf0e-_HLa4JLfks5qzjYngaJpZM4KV6BI .

Qi Zheng, Ph.D. Bioinformatician of Dermatology, Perelman School of Medicine,University of Pennsylvania 421 Curie Blvd BRB II/III 1046-7 Philadelphia, PA 19104 http://goo.gl/maps/F4FHR Office: +1 (215)-898-0173

Qi Zheng, Ph.D. Bioinformatician of Dermatology, Perelman School of Medicine,University of Pennsylvania 421 Curie Blvd BRB II/III 1046-7 Philadelphia, PA 19104 http://goo.gl/maps/F4FHR Office: +1 (215)-898-0173

colindaven commented 7 years ago

Thanks for the very clear help and notes.

Ok, bwa mem would have been my main use case, but bowtie2 is important too, i.e. for single ended reads and maybe short reads like miRNA. Another important use case would be RNA-seq (mainly Paired end) mapping with STAR. In my experience RNA-seq mappers don't do a good job with the admittedly very complex job of assigning MQ to read (fragments).

Do you have any comment on RNA-seq correction ?

Thanks, Colin

e00011027 commented 7 years ago

Hi Colin,

You are absolutely correct that current RNA-seq aligners don't do as good (as optimizing their default options) as the DNA-seq aligners. In fact STAR

Basically you can get > 99% precision and > 98% sensitivity if you set your options correctly in the config file. Remember, RNA-seq inserts (cDNA fragments) are often short ( < 100 bp), and thus the trimming step is recommended. Fill the config file with the correct 3'-adapter and 5'-adapter sequences of your kit of choice. Do use the correct adapter sequences (reverse-complement of the primers).

Best Qi

On Fri, Oct 14, 2016 at 7:41 AM, colindaven notifications@github.com wrote:

Thanks for the very clear help and notes.

Ok, bwa mem would have been my main use case, but bowtie2 is important too, i.e. for single ended reads and maybe short reads like miRNA. Another important use case would be RNA-seq (mainly Paired end) mapping with STAR. In my experience RNA-seq mappers don't do a good job with the admittedly very complex job of assigning MQ to read (fragments).

Do you have any comment on RNA-seq correction ?

Thanks, Colin

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Grice-Lab/AlignerBoost/issues/1#issuecomment-253777383, or mute the thread https://github.com/notifications/unsubscribe-auth/AG-ecRofaFlQv3AluaJ56sUF7s_kpK4iks5qz2pVgaJpZM4KV6BI .

Qi Zheng, Ph.D. Bioinformatician of Dermatology, Perelman School of Medicine,University of Pennsylvania 421 Curie Blvd BRB II/III 1046-7 Philadelphia, PA 19104 http://goo.gl/maps/F4FHR Office: +1 (215)-898-0173