ToniWestbrook / paladin

Protein Alignment and Detection Interface
MIT License
60 stars 7 forks source link

Filtering by max quality #38

Closed gundizalv closed 5 years ago

gundizalv commented 5 years ago

Hi

I have a question about the filtering by high max quality. Is a little confusing for me. When I perform the alignment, I set the T parameter in 20, as you gave that example in the man page to "preferring higher quality mappings" in the output. Again, in the ouput I have a list of the hits (60-0) in the max quality column. I should filter here? I mean, choose only hits with MaxQual=60? 50? 40? What's the thereshold? What do you recommend? Or the necessary filtering was already done in the command line with "T = 20"?

ToniWestbrook commented 5 years ago

Hi @gundizalv - it's definitely misleading the way I worded that, I will go back through and update some of the examples to remove some words that have double meaning. The T parameter refers to the alignment threshold value, e.g. the minimum score necessary to consider a sequence aligned or not. Increasing this value from the default increases the required similarity between the query and the reference to be considered aligned.

The mapping quality value (from the max/mean columns) instead comes from the MAPQ field in the SAM file. This is the mapping quality that can be interpreted as the confidence value (using the phred scale) that a mapping is "correct" or not. It takes into account the length of the match and how close to the next suboptimal match the hit is. Shorter matches, or ones that have a very close suboptimal hit, are going to have lower qualities since there's a higher chance that the match happened by random. Longer matches or ones without a close suboptimal hit will have higher qualities.

The alignment score (the one that's filtered by your -T parameter) doesn't necessarily mean high mapping quality (MAPQ). You could have a really high alignment score because of a perfect match, but have a very low mapping quality if there was also a suboptimal match that did really well too.

In general, you want to filter your results by the maximum mapping quality seen for that protein (from the max quality column). This goes under the assumption that if PALADIN aligns at least 1 read/part of a protein at really high confidence, that it's truly there, and we can consider all the other lower quality hits to the same protein as true as well. We tend to use 20 as a value to filter by, depending on how sure we want to be, and not really touch the "-T" value (unless we're doing something special/parameter sweeps). Let me know if that helps, I'll clear up that documentation too.

gundizalv commented 5 years ago

Ok. Suppouse I put "-T 20" in the command line. Now I'll have as output a .tsv file. When I check the Quality(Max) column I dont touch anything (no sort column, no delete <60 values, etc.) and accept all hits as real counts?

ToniWestbrook commented 5 years ago

Regardless of which T value you use (20, default, anything else), you should generally still filter your TSV results by the max quality column. At bare minimum, don't include anything with quality of 0, but you'll probably want to go a little higher, like 10-20. So delete any rows that have a max quality below 20. Just a side note, the "-T 20" has nothing to do with the 20 in the mapping quality. One is the alignment score threshold, one is the mapping quality. The fact that they're both 20 (from my example on the webpage and what we normally filter for in the mapping quality) is just a coincidence.

gundizalv commented 5 years ago

Ok, I was going to sent a paper and I wanted to be sure. Thank you very much!