bhofmei / jbplugin-smallrna

JBrowse plugin that supports smallRNA alignments
Other
12 stars 1 forks source link

multimapping via MQ ? #2

Closed colindaven closed 8 years ago

colindaven commented 8 years ago

Hi again,

sorry to bother you but this feedback might be interesting - if not pls ignore.

We have used bowtie2 for alignment which does not (or maybe back then did not ) write and NH multimapper tag to SAM.

I wonder if it might make more sense to use the SAM column 5 - MQ - which we generally use to assess alignment mapping quality. I know various aligners can calculate this very differently.

Would it destroy performance to display only MQ>20 or MQ>30 ?

Thanks, Colin

bhofmei commented 8 years ago

Multimapping ended up being a huge pain for various reasons. The only reason I included it in the first place was because annoj did. My lab and PI was used to annoj so I was trying to replicate that functionality.

Bowtie2 normally won’t output any information about multimapping. We had to add the ‘-k’ parameter. At the time, that was the only good solution I could come up with. I didn’t realize that mulimapping information was part of the MAPQ, at least not directly. Based on the definition, MAPQ indicates uniqueness, which is definitely related to multimapping.

It wouldn’t affect performance to include an additional filter for MAPQ as an indicator for multimapping. I’m just wouldn’t know what value to set as the threshold. It would also be possible to add another filter for this. Thoughts?

Brigitte

On Sep 19, 2016, at 11:13 AM, colindaven notifications@github.com wrote:

Hi again,

sorry to bother you but this feedback might be interesting - if not pls ignore.

We have used bowtie2 for alignment which does not (or maybe back then did not ) write and NH multimapper tag to SAM.

I wonder if it might make more sense to use the SAM column 5 - MQ - which we generally use to assess alignment mapping quality. I know various aligners can calculate this very differently.

Would it destroy performance to display only MQ>20 or MQ>30 ?

Thanks, Colin

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/bhofmei/jbplugin-smallrna/issues/2, or mute the thread https://github.com/notifications/unsubscribe-auth/APWCOuhvpk2cYeUjcp-YD4E_OZ7cz3MNks5qrqaTgaJpZM4KAmyk.

colindaven commented 8 years ago

Multimapping is definitely worth filtering out with such short reads.

You're right, probably the most consistent way to deal with multimapping is by the MQ and not by a SAM Tag (at least for those not replicating annoj functionality).

People tend to use MQ 20, 25 or 30 for multimapping filters for SNP calling on well mapped reads or investigating mapping of good transcript reads.

Looking at the SAM spec, MQ is defined as MQ= -10 * log10 Pr{mapping position is wrong}.

Calculating a range of MQ for a few different probabilities that the mapping position is wron, I get the following MQs:

-10(log10(0.0001)) [1] 40 -10(log10(0.001)) [1] 30 -10(log10(0.01)) [1] 20 -10(log10(0.1)) [1] 10 -10(log10(0.5)) [1] 3.0103 -10(log10(1.0)) [1] 0

I would suggest for short reads, using a MQ>= 20 (that is, 0.01 probability that a read mapping position is incorrect) would be appropriate. Generally from what I have seen, reads with another perfect mapping position without mismatches in the genome get a MQ of 0, though this(or maybe just other MQ values) is highly aligner dependent.

Thanks for having a look at this.

Colin

bhofmei commented 8 years ago

Sorry for not responding sooner. I had a few things to get done and needed to talk to my PI. We work in plants, and it’s important to include the multimapping reads as it can be indicitive of function. So, he does not want to use the MAPQ to filter out multimapped reads. The reported MAPQ, as you mentioned, is highly dependent on the program run. I’ve seen both 0 and 255 reported for mutlimapped reads. I think the best option for everyone will be to add another filter in which the user can set the minimum score. Does that work for you?

Brigitte

On Sep 20, 2016, at 3:42 AM, colindaven notifications@github.com wrote:

Multimapping is definitely worth filtering out with such short reads.

You're right, probably the most consistent way to deal with multimapping is by the MQ and not by a SAM Tag (at least for those not replicating annoj functionality).

People tend to use MQ 20, 25 or 30 for multimapping filters for SNP calling on well mapped reads or investigating mapping of good transcript reads.

Looking at the SAM spec, MQ is defined as MQ= -10 * log10 Pr{mapping position is wrong}.

Calculating a range of MQ for a few different probabilities that the mapping position is wron, I get the following MQs:

-10(log10(0.0001)) [1] 40 -10(log10(0.001)) [1] 30 -10(log10(0.01)) [1] 20 -10(log10(0.1)) [1] 10 -10(log10(0.5)) [1] 3.0103 -10(log10(1.0)) [1] 0

I would suggest for short reads, using a MQ>= 20 (that is, 0.01 probability that a read mapping position is incorrect) would be appropriate. Generally from what I have seen, reads with another perfect mapping position without mismatches in the genome get a MQ of 0, though this(or maybe just other MQ values) is highly aligner dependent.

Thanks for having a look at this.

Colin

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/bhofmei/jbplugin-smallrna/issues/2#issuecomment-248227837, or mute the thread https://github.com/notifications/unsubscribe-auth/APWCOpsKX_iQ80iEfxIpRAmmMcUVumq6ks5qr451gaJpZM4KAmyk.

colindaven commented 8 years ago

Hi,

thanks again, I think the proposed solution with a user defined filter would be great. We work mainly on plants too and everyone wants and uses MQ (funnily enough!). Anyway, a diversity of opinions is not a bad thing.

255 should not be used for multimapped reads, this would be a major oversight by the aligners programmer, as it is clear in the SAM specs "A value 255 indicates that the mapping quality is not available."

Cheers, Colin

bhofmei commented 8 years ago

I've added the option to filter by quality score. You can filter per track via track drop-down menu or for all visible tracks via button in toolbar. I've also updated the documentation accordingly. I wasn't entirely sure what to do with reads with score 255. Currently, it will show those reads when the minimum quality score is 0 but are hidden for any other minimum. It might make sense to never hide these reads. I'm going to rely on your feedback for that.

colindaven commented 8 years ago

Hi Brigitte,

thanks for this, it's great.

A couple of pointers:

I would definitely call it Mapping Quality not quality score. Quality score can be confused with read base qualities too easily.

Reads with 255 should probably never be filtered in my opinion, since the mapper did not assign a mapping quality to them, therefore the quality of the mapping is completely unknown. I don't think these crop up too much with the main aligners.

Lastly, I may have found one bug. Load 2 siRNA tracks. Set MQ30 on track1. Set MQ20 on track 2. Now go back to track 1 and try to set MQ to some other value, eg. 0 again. For me this did not work. Tested on Firefox on Windows 7. (This is a bit weird though, I could not replicate this consistently myself. However, I just managed to replicate on chrome.)

I tested the Toolbar button which works very well for filtering all tracks.

Thanks for the great implementation. Colin

colindaven commented 8 years ago

Ah, with the toolbar button there is also maybe a bug. I set MQ to 100 - all reads are filtered out as expected. Setting to 20 again did not show any more reads though. (Firefox, Windows)

bhofmei commented 8 years ago

thanks for the info. Small RNA is not my area of expertise so I'm relying on others for help and the person who does most of the small RNA analysis in the lab hasn't been using the plugin yet.

I'll look into those bugs ASAP. Any chance you could pass along the BAM files you used or any other small RNA bam files that you have mapped? Or could you tell me what program/parameters you used for mapping?

I couldn't properly test everything, like the bugs you've found, because the reads in the BAM files I have either have mapping quality 0 or 255. I tried to use regular RNA-seq data to test the filtering capabilities but the RNA-seq reads all have scores of 50 or 0.

colindaven commented 8 years ago

No need to look into the bug(s) too urgently, the plugin is still very usable.

I just used bowtie2 for the alignments, version 2.0.5 (some of them are a bit old). The modern version should perform very similarly. Just try it on your data without the -k parameter you used in the past.

I.e. from the perl script, pretty much default parameters. bowtie2 -p $noCores -x $reference -U $infile -S $infileprefix.sam;

That should compute the mapping quality score for you. If not pls, get back to me, and I'll create a test case relevant to a Jbrowse you run (I guess TAIR10 or a crop ?) with a small public dataset. Its probably far easier for you to run this on your own genome in house with one of your test small RNA datasets.

Its a bit weird that your RNA-seq data has MQ of eitehr 50 or 0. Generally though RNA-seq mappers don't do such a good job of attributing MQ because of their behaviour splitting reads across splice sites - I use STAR which seems to give very low (3) or very high (255). Not great. It works better for genomic sequencing data or for miRNAs.

bhofmei commented 8 years ago

Turns out we used bowtie for mapping our small RNA data. Apparently its general practice to use bowtie vs bowtie 2 for small RNA since we aren’t interested in gaps. Bowtie will only report MAPQ scores of 255 or 0 unless otherwise specified (and even then it only reports the specified number). It’s so frustrating how different alignment programs assign mapping quality scores so differently.

On Sep 28, 2016, at 3:58 AM, colindaven notifications@github.com wrote:

No need to look into the bug(s) too urgently, the plugin is still very usable.

I just used bowtie2 for the alignments, version 2.0.5 (some of them are a bit old). The modern version should perform very similarly. Just try it on your data without the -k parameter you used in the past.

I.e. from the perl script, pretty much default parameters. bowtie2 -p $noCores -x $reference -U $infile -S $infileprefix.sam;

That should compute the mapping quality score for you. If not pls, get back to me, and I'll create a test case relevant to a Jbrowse you run (I guess TAIR10 or a crop ?) with a small public dataset. Its probably far easier for you to run this on your own genome in house with one of your test small RNA datasets.

Its a bit weird that your RNA-seq data has MQ of eitehr 50 or 0. Generally though RNA-seq mappers don't do such a good job of attributing MQ because of their behaviour splitting reads across splice sites - I use STAR which seems to give very low (3) or very high (255). Not great. It works better for genomic sequencing data or for miRNAs.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/bhofmei/jbplugin-smallrna/issues/2#issuecomment-250098197, or mute the thread https://github.com/notifications/unsubscribe-auth/APWCOoAWOk-YSnoHKOSH72BpGPy3_J26ks5quh5CgaJpZM4KAmyk.

colindaven commented 8 years ago

Ok, that explains the MQ. Bowtie is just designed for pure speed. I would worry a bit about it's accuracy in general expecially with non-perfect non model genome sequences. If you let me know which genome and version, and also the chromosome name style, you use for one of your genomes I can prepare a toy test example given I find a little public data.

bhofmei commented 8 years ago

I finally got around to looking into the bug you mentioned. I’ve been able to fix it. Just update the plugin to the lastest version and it should work! Also, per your suggestion, I renamed things from “quality score” to “mapping quality”

Brigitte

On Oct 4, 2016, at 3:39 AM, colindaven notifications@github.com wrote:

Ok, that explains the MQ. Bowtie is just designed for pure speed. I would worry a bit about it's accuracy in general expecially with non-perfect non model genome sequences. If you let me know which genome and version, and also the chromosome name style, you use for one of your genomes I can prepare a toy test example given I find a little public data.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/bhofmei/jbplugin-smallrna/issues/2#issuecomment-251316299, or mute the thread https://github.com/notifications/unsubscribe-auth/APWCOqQ3zuiPw9hOCwAvJ00PGKjrni_Bks5qwgKugaJpZM4KAmyk.