ToniWestbrook / paladin

Protein Alignment and Detection Interface
MIT License
60 stars 7 forks source link

few reads mapping #18

Closed macmanes closed 8 years ago

macmanes commented 9 years ago

With version 0.1.3, few reads are mapping - to the Merlot samples, about 4% of reads map and with AcidovoraxAvenaeATCC19860-se-250-1000-10.fq only 12% are mapping. :sob:

I thought this was related to last nights fix, but when rolling back to 0.1.2, the mapping is about the same, maybe even a bit worse.

ToniWestbrook commented 9 years ago

I just double (and triple, I got confused for a sec) checked the current alignment percentages against those from the automated tests I was running for Jordan and Louisa from a week ago just to be sure they matched. That drop you see probably means that the results you had before (the 12% for Merlot or whatever the percentage was) were generated with a version of PALADIN from last week that was pre-scoring matrix fix. The 4% is accurate (as far as I know, assuming there isn't a deeper bug) given the correct scoring matrix and the current values we have for the penalty scores and the threshold score. You can get back up to a bigger percentage by dropping the threshold value from 30 to 20 or 10, but I'm not sure how accurate any of it is (the scoring fix, dropping the threshold, etc) until we get the new references and updated GO term analysis pipeline from Jordan to test out all the combinations of penalty and threshold values. I think what that scoring matrix fix did was drop down the read mapped percentage because it dropped off most of the crap that was just nonsense mappings before. But we'll know more after running the pipeline on all the combinations. Let me know if you see something different though that doesn't make sense with what I'm thinking is the issue.

ToniWestbrook commented 9 years ago

The more I was thinking about this (I'd like to get your and others take today too), the current seed and threshold value effectively requires PALADIN a longer total requirement than BWA currently has. Since BWA requirement is a total score of 30 nucleotide-space "points", having the same for protein-space "points" is almost like upping the requirement to 90 nucleotide-space "points" (ie 30 * 3). It would be interesting to see how well the GO analysis pipeline does for having a threshold value of 10 (which does also yield much higher matched percentages). More on that later today

macmanes commented 9 years ago

also, I should say that making paladin map more permissive increases dramatically the mappign rate.. paladin align -u 2 -B0 -O0 -E0 -L0

macmanes commented 9 years ago

I actually thing permissive mapping makes sense in this case given we are trying to map very divergent things. Looking at the UniProt report the hits make sense, and mapping rate increases 3x.

About your point with the shorter ORF - this is exactly what I was trying to do.. an ORF of 30AA should be very unlikely to arise by random chance.

ToniWestbrook commented 9 years ago

the permissive mapping setting of no penalties gives an interesting metric to look at too from an analysis point of view, as I think it should show what the potential maximum number of alignments for that seed length would be (e.g. literally anything that matches the seed will align, with the primary being the best one out of all the potential secondary alignments). Something to talk more about today

ToniWestbrook commented 9 years ago

Oh - we'd need to add in -T 0 for that upper limit potential maximum number of alignments, but still interesting even from a threshold of 30

macmanes commented 9 years ago

Isn't -T0 toggling the score to be reported? Anyway, when setting -T0 you do get about 30% more reads mapping.

ToniWestbrook commented 8 years ago

Parameters have been adjusted after a large amount of tests and have issues dealing with poor alignment (when possible). Closing