chipseq bowtie2 output mismatch tag is for indivudule read in the pair-end reads case.

hxin commented 5 years ago

SRR6685151.1 83 3 131032060 255 150M = 131031971 -239 TATTTTATACATTAGATCCCTCATTTAAATGTTATATGATGCCCCTTTATTCCATAGTGTGAATATTCAGTATAACTAAAAGACTTTCCGACAAAACTGATCATAAAATGAGGGCCTGTCACAATTAGATCACTAATACAGTGGCTACTC F77A<--7--7A--)A7AF<F7FAFA7--AFF<FA7-A-J-JJFJF-<A-FF-FFJF7<<JJJJFFJ<JJJJJJFJJJJFJF7JJJF<JJF-<-AJJJJJJJJJFJFJJJAJJJJJJJFJJ-<JJJJJJJAJJJJJJFJA-JFJA-F<AA AS:i:-15 XN:i:0 XM:i:5 XO:i:0 XG:i:0 NM:i:5 MD:Z:9G3G26C8G2A97 YS:i:-53 YT:Z:CP SRR6685151.1 163 3 131031971 255 150M = 131032060 239 GGTTATAACCTTGCTTCAGGGATAGGGGATCCATACCCTGTAAAAGAATGCATTATACCTCTTTGTGCATGATTTTTATTTATTGCATGTATTTTATATATTTCATCCCTCATGGAAATGTTAGATGAGCCCCCTTTAGTTAATACTACT --AAAF-F<FFFFJJF-AA<-FJJFJAFFFFJJ<<AFAJ-FFF--<JFF-7F<AF<7FJFJJJJFFJ-7FJ-<A<-7-F---7A-A7AA-AF-<-F-A-A7--7-AJF)-<----77-A<----7--7--7-7<-A-7--------7--- AS:i:-53 XN:i:0 XM:i:17 XO:i:0 XG:i:0 NM:i:17 MD:Z:1T42T11G14C9G16G3G0G9T0T8T4T11C4G1G0T0G0YS:i:-15 YT:Z:CP

The number of mismatches should be the sum of this number? Overlap?

hxin commented 5 years ago

3 dna:chromosome chromosome:GRCm38:3:131032060:131032210:1 TATTTTATAGATTGGATCCCTCATTTAAATGTTATATGATCCCCCTTTAGTCAATAGTGT GAATATTCAGTATAACTAAAAGACTTTCCGACAAAACTGATCATAAAATGAGGGCCTGTC ACAATTAGATCACTAATACAGTGGCTACTCT

3 dna:chromosome chromosome:GRCm38:3:131031971:131032121:1 GTTTATAACCTTGCTTCAGGGATAGGGGATCCATACCCTGTAAATGAATGCATTATGCCT CTTTGTGCATGCTTTTTATTTGTTGCATGTATTTTATAGATTGGATCCCTCATTTAAATG TTATATGATCCCCCTTTAGTCAATAGTGTGA

SRR6685151.1 83 3 131032060 255 150M = 131031971 -239 TATTTTATACATTAGATCCCTCATTTAAATGTTATATGATGCCCCTTTATTCCATAGTGTGAATATTCAGTATAACTAAAAGACTTTCCGACAAAACTGATCATAAAATGAGGGCCTGTCACAATTAGATCACTAATACAGTGGCTACTC F77A<--7--7A--)A7AF<F7FAFA7--AFF<FA7-A-J-JJFJF-<A-FF-FFJF7<<JJJJFFJ<JJJJJJFJJJJFJF7JJJF<JJF-<-AJJJJJJJJJFJFJJJAJJJJJJJFJJ-<JJJJJJJAJJJJJJFJA-JFJA-F<AA AS:i:-15 XN:i:0 XM:i:5 XO:i:0 XG:i:0 NM:i:5 MD:Z:9G3G26C8G2A97 YS:i:-53 YT:Z:CP SRR6685151.1 163 3 131031971 255 150M = 131032060 239 GGTTATAACCTTGCTTCAGGGATAGGGGATCCATACCCTGTAAAAGAATGCATTATACCTCTTTGTGCATGATTTTTATTTATTGCATGTATTTTATATATTTCATCCCTCATGGAAATGTTAGATGAGCCCCCTTTAGTTAATACTACT --AAAF-F<FFFFJJF-AA<-FJJFJAFFFFJJ<<AFAJ-FFF--<JFF-7F<AF<7FJFJJJJFFJ-7FJ-<A<-7-F---7A-A7AA-AF-<-F-A-A7--7-AJF)-<----77-A<----7--7--7-7<-A-7--------7--- AS:i:-53 XN:i:0 XM:i:17 XO:i:0 XG:i:0 NM:i:17 MD:Z:1T42T11G14C9G16G3G0G9T0T8T4T11C4G1G0T0G0YS:i:-15 YT:Z:CP

hxin commented 5 years ago

If two reads overlap, the sum of XM will count the overlapping region twice, which is an over-estimate of the number of mismatches. We can work out the overlapping bit using and the mismatches fall into that region with the MD tag but this is likely to slow down the speed of the program.

A less good estimation is to use the max XM of the two reads.

hxin commented 5 years ago

We decided to use the max XM of the two pair as the mismatches for this pair.

hxin commented 5 years ago

Before/After making the change:

pe
2019-05-29 15:24:12 INFO: Species 1: wrote 4516 filtered hits for 269 reads; 3596 hits for 109 reads were rejected outright, and 0 hits for 0 reads were rejected as ambiguous.
2019-05-29 15:24:12 INFO: Species 2: wrote 914 filtered hits for 29 reads; 72 hits for 5 reads were rejected outright, and 0 hits for 0 reads were rejected as ambiguous.
2019-05-29 15:24:12 INFO: Species 3: wrote 0 filtered hits for 0 reads; 458 hits for 20 reads were rejected outright, and 0 hits for 0 reads were rejected as ambiguous.

2019-05-29 15:38:27 INFO: Species 1: wrote 2070 filtered hits for 163 reads; 6042 hits for 215 reads were rejected outright, and 0 hits for 0 reads were rejected as ambiguous.
2019-05-29 15:38:27 INFO: Species 2: wrote 720 filtered hits for 20 reads; 266 hits for 14 reads were rejected outright, and 0 hits for 0 reads were rejected as ambiguous.
2019-05-29 15:38:27 INFO: Species 3: wrote 0 filtered hits for 0 reads; 458 hits for 20 reads were rejected outright, and 0 hits for 0 reads were rejected as ambiguous.

se
2019-05-29 15:26:29 INFO: Species 1: wrote 3961 filtered hits for 318 reads; 2008 hits for 93 reads were rejected outright, and 1909 hits for 28 reads were rejected as ambiguous.
2019-05-29 15:26:29 INFO: Species 2: wrote 0 filtered hits for 0 reads; 417 hits for 7 reads were rejected outright, and 1413 hits for 17 reads were rejected as ambiguous.
2019-05-29 15:26:29 INFO: Species 3: wrote 128 filtered hits for 3 reads; 1227 hits for 67 reads were rejected outright, and 1910 hits for 28 reads were rejected as ambiguous.

2019-05-29 15:38:03 INFO: Species 1: wrote 3961 filtered hits for 318 reads; 2008 hits for 93 reads were rejected outright, and 1909 hits for 28 reads were rejected as ambiguous.
2019-05-29 15:38:03 INFO: Species 2: wrote 0 filtered hits for 0 reads; 417 hits for 7 reads were rejected outright, and 1413 hits for 17 reads were rejected as ambiguous.
2019-05-29 15:38:03 INFO: Species 3: wrote 128 filtered hits for 3 reads; 1227 hits for 67 reads were rejected outright, and 1910 hits for 28 reads were rejected as ambiguous.

hxin commented 5 years ago

We change our mind!

We decided to use the average XM of the two pair as the mismatches for this pair!

hxin commented 5 years ago

ERR2721212.10000029     99      19      54802536        255     101M    =       54802753        319     ATGCTGATATATCAATGTGCTGCAGAGTCAAAACCAGACACCAATAGTCAGCAATAAAGCATCATGCTCCAGGACTGAAGCTCATGAGTAGAGTGCTTGCT   GGGAGAGAGAG.<GGGA<.GGGGGGGAAGGGGG.AGAGGGAAGGAGGGGGGGGGGIIIGGIIIGIGGIGGGGGIGIGAGGGGGGAG.<<AGGGGGGGIIII   
AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:101        YS:i:-17        YT:Z:CP

ERR2721212.10000029     147     19      54802753        255     29M1D72M        =       54802536        -319    TCAGTGGGAGATGCTGTCTCAAAGAAAAGAAAAAAAAAAAAAAGAAGAAGAAGAAATATGATGAGAGAGAAGCAAAAGGTAACTGATGTCAACATCTGGCT   A.GG.<.<.<....AAA<<.<.<.AG<..GIIIIIIGIIIGGGGGGIGGGGAGIGGGIGIGGGGGGGGGG.AIGGAGGGGAAGGGGGGGGAIGGGGAGGAG   
AS:i:-17        XN:i:0  XM:i:3  XO:i:1  XG:i:1  NM:i:4  MD:Z:6A14C5T1^A72       YS:i:0  YT:Z:CP

Accepted by master, rejected by avg

hxin commented 5 years ago

ERR2721212.10000983 99  6   34489833    255 101M    =   34489995    263 GTGCAGCTTCCTAAAGGGGTCCAGTGCACTCCTCTCAAGACCAGCTACCCCACCCCCACCCCCCACTTAGTGCCCTACTGTAGCTGTGAAGGGCCCAGAGC   GGGAGGGGGIGIIIIIGIIGIIIIIIGGGIIIIGIIIGIIIGGGIIIIGIIAGGGGI.AGGI.A<<.<<<A<<GG....<GGAGGGGGAGAGG..G.<...   
AS:i:-3 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:62A38  YS:i:0  YT:Z:CP

ERR2721212.10000983 147 6   34489995    255 101M    =   34489833    -263    AGGCAGGGCTATCCTTATGTCCAGCTTCACAGCTCTAGGTGCTGGCGATTGCTGGCAGTCTTTGGCATTTATTCCCTGGTATTTGGACATCATTCCTGTCT   GIIIIIIIIIIIGIGIIIIIIGIGIIIIGGGGIIGIIIIIIIIIIIIIIIIIIIIIIGIIIIIIIIIIIIIIIIIIIIIIGIIIIIIGIIIIIIIIGGGGG   
AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:101    YS:i:-3 YT:Z:CP

Accepted by avg, rejected by master

# python integer works like this:
(1+0)/2=0

hxin commented 5 years ago

I ran tests on real chipseq data

dna_human_1
dna_mouse_1

I used four branches:

883c044982e5d67500aa4dbc20daf1df389b25b4 dev_avg
- integer average mismatch of the two reads
98809743ff5a4b2a0f6700041fab88abc45ac08d dev_max
- max mismatch of the two reads
e9d33c41269dc89ab90684a3a28f714df15701f4 master
- mismatch of the first read
74a7d94ff38c2c666c5b5436719b93a3a6a59fe9 dev_avg_float
- float average mismatch of the two reads

The target code is:

    @classmethod
    def _get_mismatches(cls, hits ):
        # https://github.com/statbio/Sargasso/issues/96
        if cls._is_paired_hit(hits[0]):
            return float(hits[0].get_tag("XM") + hits[1].get_tag("XM"))/2
        return hits[0].get_tag("XM")

The results:

mouse

master vs max Master assigned MORE corrected reads. This is possibly due to the cases where the first read pair has less mismatch than the second one.
master vs avg_float Master assigned MORE corrected reads. In the case of a pair with mismatches of (0,1), the master will accept it but avg_float will reject it.
master vs avg

Master assigned LESS corrected reads. This is probably due to the use of integer to calculate the average when (1,0) or (0,1) will have 0 mismatch as the average.

Master uses the mismatch from the first read of a pair. This is somewhat unstable. One example would be the master vs avg human data, where the master perform better in one sample(conservative 516.83K), but worse(conservative −585.67K)in the other.

Only when the mismatch has an uneven distribution, favouring in the first read, the master will perform better than AVG.

I think the AVG_float or even the SUM of mismatches(Not tested) of the two pair, make sense. However, this will result in losing a large amount of corrected assigned reads.

@lweasel

hxin commented 5 years ago

We finally decided to use the average float! fixed 74a7d94ff38c2c666c5b5436719b93a3a6a59fe9 Using average mismatch will give us less reads, but we might be able to get aound this by using --best structure instead of the --conservative.

biomedicalinformaticsgroup / Sargasso

chipseq bowtie2 output mismatch tag is for indivudule read in the pair-end reads case. #96

mouse