Closed hxin closed 5 years ago
3 dna:chromosome chromosome:GRCm38:3:131032060:131032210:1 TATTTTATAGATTGGATCCCTCATTTAAATGTTATATGATCCCCCTTTAGTCAATAGTGT GAATATTCAGTATAACTAAAAGACTTTCCGACAAAACTGATCATAAAATGAGGGCCTGTC ACAATTAGATCACTAATACAGTGGCTACTCT
3 dna:chromosome chromosome:GRCm38:3:131031971:131032121:1 GTTTATAACCTTGCTTCAGGGATAGGGGATCCATACCCTGTAAATGAATGCATTATGCCT CTTTGTGCATGCTTTTTATTTGTTGCATGTATTTTATAGATTGGATCCCTCATTTAAATG TTATATGATCCCCCTTTAGTCAATAGTGTGA
SRR6685151.1 83 3 131032060 255 150M = 131031971 -239 TATTTTATACATTAGATCCCTCATTTAAATGTTATATGATGCCCCTTTATTCCATAGTGTGAATATTCAGTATAACTAAAAGACTTTCCGACAAAACTGATCATAAAATGAGGGCCTGTCACAATTAGATCACTAATACAGTGGCTACTC F77A<--7--7A--)A7AF<F7FAFA7--AFF<FA7-A-J-JJFJF-<A-FF-FFJF7<<JJJJFFJ<JJJJJJFJJJJFJF7JJJF<JJF-<-AJJJJJJJJJFJFJJJAJJJJJJJFJJ-<JJJJJJJAJJJJJJFJA-JFJA-F<AA AS:i:-15 XN:i:0 XM:i:5 XO:i:0 XG:i:0 NM:i:5 MD:Z:9G3G26C8G2A97 YS:i:-53 YT:Z:CP SRR6685151.1 163 3 131031971 255 150M = 131032060 239 GGTTATAACCTTGCTTCAGGGATAGGGGATCCATACCCTGTAAAAGAATGCATTATACCTCTTTGTGCATGATTTTTATTTATTGCATGTATTTTATATATTTCATCCCTCATGGAAATGTTAGATGAGCCCCCTTTAGTTAATACTACT --AAAF-F<FFFFJJF-AA<-FJJFJAFFFFJJ<<AFAJ-FFF--<JFF-7F<AF<7FJFJJJJFFJ-7FJ-<A<-7-F---7A-A7AA-AF-<-F-A-A7--7-AJF)-<----77-A<----7--7--7-7<-A-7--------7--- AS:i:-53 XN:i:0 XM:i:17 XO:i:0 XG:i:0 NM:i:17 MD:Z:1T42T11G14C9G16G3G0G9T0T8T4T11C4G1G0T0G0YS:i:-15 YT:Z:CP
If two reads overlap, the sum of XM will count the overlapping region twice, which is an over-estimate of the number of mismatches. We can work out the overlapping bit using and the mismatches fall into that region with the MD tag but this is likely to slow down the speed of the program.
A less good estimation is to use the max XM of the two reads.
We decided to use the max XM of the two pair as the mismatches for this pair.
Before/After making the change:
pe
2019-05-29 15:24:12 INFO: Species 1: wrote 4516 filtered hits for 269 reads; 3596 hits for 109 reads were rejected outright, and 0 hits for 0 reads were rejected as ambiguous.
2019-05-29 15:24:12 INFO: Species 2: wrote 914 filtered hits for 29 reads; 72 hits for 5 reads were rejected outright, and 0 hits for 0 reads were rejected as ambiguous.
2019-05-29 15:24:12 INFO: Species 3: wrote 0 filtered hits for 0 reads; 458 hits for 20 reads were rejected outright, and 0 hits for 0 reads were rejected as ambiguous.
2019-05-29 15:38:27 INFO: Species 1: wrote 2070 filtered hits for 163 reads; 6042 hits for 215 reads were rejected outright, and 0 hits for 0 reads were rejected as ambiguous.
2019-05-29 15:38:27 INFO: Species 2: wrote 720 filtered hits for 20 reads; 266 hits for 14 reads were rejected outright, and 0 hits for 0 reads were rejected as ambiguous.
2019-05-29 15:38:27 INFO: Species 3: wrote 0 filtered hits for 0 reads; 458 hits for 20 reads were rejected outright, and 0 hits for 0 reads were rejected as ambiguous.
se
2019-05-29 15:26:29 INFO: Species 1: wrote 3961 filtered hits for 318 reads; 2008 hits for 93 reads were rejected outright, and 1909 hits for 28 reads were rejected as ambiguous.
2019-05-29 15:26:29 INFO: Species 2: wrote 0 filtered hits for 0 reads; 417 hits for 7 reads were rejected outright, and 1413 hits for 17 reads were rejected as ambiguous.
2019-05-29 15:26:29 INFO: Species 3: wrote 128 filtered hits for 3 reads; 1227 hits for 67 reads were rejected outright, and 1910 hits for 28 reads were rejected as ambiguous.
2019-05-29 15:38:03 INFO: Species 1: wrote 3961 filtered hits for 318 reads; 2008 hits for 93 reads were rejected outright, and 1909 hits for 28 reads were rejected as ambiguous.
2019-05-29 15:38:03 INFO: Species 2: wrote 0 filtered hits for 0 reads; 417 hits for 7 reads were rejected outright, and 1413 hits for 17 reads were rejected as ambiguous.
2019-05-29 15:38:03 INFO: Species 3: wrote 128 filtered hits for 3 reads; 1227 hits for 67 reads were rejected outright, and 1910 hits for 28 reads were rejected as ambiguous.
We change our mind!
We decided to use the average XM of the two pair as the mismatches for this pair!
ERR2721212.10000029 99 19 54802536 255 101M = 54802753 319 ATGCTGATATATCAATGTGCTGCAGAGTCAAAACCAGACACCAATAGTCAGCAATAAAGCATCATGCTCCAGGACTGAAGCTCATGAGTAGAGTGCTTGCT GGGAGAGAGAG.<GGGA<.GGGGGGGAAGGGGG.AGAGGGAAGGAGGGGGGGGGGIIIGGIIIGIGGIGGGGGIGIGAGGGGGGAG.<<AGGGGGGGIIII
AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YS:i:-17 YT:Z:CP
ERR2721212.10000029 147 19 54802753 255 29M1D72M = 54802536 -319 TCAGTGGGAGATGCTGTCTCAAAGAAAAGAAAAAAAAAAAAAAGAAGAAGAAGAAATATGATGAGAGAGAAGCAAAAGGTAACTGATGTCAACATCTGGCT A.GG.<.<.<....AAA<<.<.<.AG<..GIIIIIIGIIIGGGGGGIGGGGAGIGGGIGIGGGGGGGGGG.AIGGAGGGGAAGGGGGGGGAIGGGGAGGAG
AS:i:-17 XN:i:0 XM:i:3 XO:i:1 XG:i:1 NM:i:4 MD:Z:6A14C5T1^A72 YS:i:0 YT:Z:CP
Accepted by master, rejected by avg
ERR2721212.10000983 99 6 34489833 255 101M = 34489995 263 GTGCAGCTTCCTAAAGGGGTCCAGTGCACTCCTCTCAAGACCAGCTACCCCACCCCCACCCCCCACTTAGTGCCCTACTGTAGCTGTGAAGGGCCCAGAGC GGGAGGGGGIGIIIIIGIIGIIIIIIGGGIIIIGIIIGIIIGGGIIIIGIIAGGGGI.AGGI.A<<.<<<A<<GG....<GGAGGGGGAGAGG..G.<...
AS:i:-3 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:62A38 YS:i:0 YT:Z:CP
ERR2721212.10000983 147 6 34489995 255 101M = 34489833 -263 AGGCAGGGCTATCCTTATGTCCAGCTTCACAGCTCTAGGTGCTGGCGATTGCTGGCAGTCTTTGGCATTTATTCCCTGGTATTTGGACATCATTCCTGTCT GIIIIIIIIIIIGIGIIIIIIGIGIIIIGGGGIIGIIIIIIIIIIIIIIIIIIIIIIGIIIIIIIIIIIIIIIIIIIIIIGIIIIIIGIIIIIIIIGGGGG
AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YS:i:-3 YT:Z:CP
Accepted by avg, rejected by master
# python integer works like this:
(1+0)/2=0
I ran tests on real chipseq data
I used four branches:
The target code is:
@classmethod
def _get_mismatches(cls, hits ):
# https://github.com/statbio/Sargasso/issues/96
if cls._is_paired_hit(hits[0]):
return float(hits[0].get_tag("XM") + hits[1].get_tag("XM"))/2
return hits[0].get_tag("XM")
The results:
master vs max Master assigned MORE corrected reads. This is possibly due to the cases where the first read pair has less mismatch than the second one.
master vs avg_float Master assigned MORE corrected reads. In the case of a pair with mismatches of (0,1), the master will accept it but avg_float will reject it.
master vs avg
Master assigned LESS corrected reads. This is probably due to the use of integer to calculate the average when (1,0) or (0,1) will have 0 mismatch as the average.
Master uses the mismatch from the first read of a pair. This is somewhat unstable. One example would be the master vs avg human data, where the master perform better in one sample(conservative 516.83K), but worse(conservative −585.67K)in the other.
Only when the mismatch has an uneven distribution, favouring in the first read, the master will perform better than AVG.
I think the AVG_float or even the SUM of mismatches(Not tested) of the two pair, make sense. However, this will result in losing a large amount of corrected assigned reads.
@lweasel
We finally decided to use the average float! fixed 74a7d94ff38c2c666c5b5436719b93a3a6a59fe9 Using average mismatch will give us less reads, but we might be able to get aound this by using --best structure instead of the --conservative.
The number of mismatches should be the sum of this number? Overlap?