biomedicalinformaticsgroup / Sargasso

Sargasso disambiguates mixed-species high-throughput sequencing data.
http://biomedicalinformaticsgroup.github.io/Sargasso/
Other
8 stars 4 forks source link

chipseq bowtie2 output mismatch tag is for indivudule read in the pair-end reads case. #96

Closed hxin closed 5 years ago

hxin commented 5 years ago

SRR6685151.1 83 3 131032060 255 150M = 131031971 -239 TATTTTATACATTAGATCCCTCATTTAAATGTTATATGATGCCCCTTTATTCCATAGTGTGAATATTCAGTATAACTAAAAGACTTTCCGACAAAACTGATCATAAAATGAGGGCCTGTCACAATTAGATCACTAATACAGTGGCTACTC F77A<--7--7A--)A7AF<F7FAFA7--AFF<FA7-A-J-JJFJF-<A-FF-FFJF7<<JJJJFFJ<JJJJJJFJJJJFJF7JJJF<JJF-<-AJJJJJJJJJFJFJJJAJJJJJJJFJJ-<JJJJJJJAJJJJJJFJA-JFJA-F<AA AS:i:-15 XN:i:0 XM:i:5 XO:i:0 XG:i:0 NM:i:5 MD:Z:9G3G26C8G2A97 YS:i:-53 YT:Z:CP SRR6685151.1 163 3 131031971 255 150M = 131032060 239 GGTTATAACCTTGCTTCAGGGATAGGGGATCCATACCCTGTAAAAGAATGCATTATACCTCTTTGTGCATGATTTTTATTTATTGCATGTATTTTATATATTTCATCCCTCATGGAAATGTTAGATGAGCCCCCTTTAGTTAATACTACT --AAAF-F<FFFFJJF-AA<-FJJFJAFFFFJJ<<AFAJ-FFF--<JFF-7F<AF<7FJFJJJJFFJ-7FJ-<A<-7-F---7A-A7AA-AF-<-F-A-A7--7-AJF)-<----77-A<----7--7--7-7<-A-7--------7--- AS:i:-53 XN:i:0 XM:i:17 XO:i:0 XG:i:0 NM:i:17 MD:Z:1T42T11G14C9G16G3G0G9T0T8T4T11C4G1G0T0G0YS:i:-15 YT:Z:CP

The number of mismatches should be the sum of this number? Overlap?

hxin commented 5 years ago

3 dna:chromosome chromosome:GRCm38:3:131032060:131032210:1 TATTTTATAGATTGGATCCCTCATTTAAATGTTATATGATCCCCCTTTAGTCAATAGTGT GAATATTCAGTATAACTAAAAGACTTTCCGACAAAACTGATCATAAAATGAGGGCCTGTC ACAATTAGATCACTAATACAGTGGCTACTCT

3 dna:chromosome chromosome:GRCm38:3:131031971:131032121:1 GTTTATAACCTTGCTTCAGGGATAGGGGATCCATACCCTGTAAATGAATGCATTATGCCT CTTTGTGCATGCTTTTTATTTGTTGCATGTATTTTATAGATTGGATCCCTCATTTAAATG TTATATGATCCCCCTTTAGTCAATAGTGTGA

SRR6685151.1 83 3 131032060 255 150M = 131031971 -239 TATTTTATACATTAGATCCCTCATTTAAATGTTATATGATGCCCCTTTATTCCATAGTGTGAATATTCAGTATAACTAAAAGACTTTCCGACAAAACTGATCATAAAATGAGGGCCTGTCACAATTAGATCACTAATACAGTGGCTACTC F77A<--7--7A--)A7AF<F7FAFA7--AFF<FA7-A-J-JJFJF-<A-FF-FFJF7<<JJJJFFJ<JJJJJJFJJJJFJF7JJJF<JJF-<-AJJJJJJJJJFJFJJJAJJJJJJJFJJ-<JJJJJJJAJJJJJJFJA-JFJA-F<AA AS:i:-15 XN:i:0 XM:i:5 XO:i:0 XG:i:0 NM:i:5 MD:Z:9G3G26C8G2A97 YS:i:-53 YT:Z:CP SRR6685151.1 163 3 131031971 255 150M = 131032060 239 GGTTATAACCTTGCTTCAGGGATAGGGGATCCATACCCTGTAAAAGAATGCATTATACCTCTTTGTGCATGATTTTTATTTATTGCATGTATTTTATATATTTCATCCCTCATGGAAATGTTAGATGAGCCCCCTTTAGTTAATACTACT --AAAF-F<FFFFJJF-AA<-FJJFJAFFFFJJ<<AFAJ-FFF--<JFF-7F<AF<7FJFJJJJFFJ-7FJ-<A<-7-F---7A-A7AA-AF-<-F-A-A7--7-AJF)-<----77-A<----7--7--7-7<-A-7--------7--- AS:i:-53 XN:i:0 XM:i:17 XO:i:0 XG:i:0 NM:i:17 MD:Z:1T42T11G14C9G16G3G0G9T0T8T4T11C4G1G0T0G0YS:i:-15 YT:Z:CP

hxin commented 5 years ago

If two reads overlap, the sum of XM will count the overlapping region twice, which is an over-estimate of the number of mismatches. We can work out the overlapping bit using and the mismatches fall into that region with the MD tag but this is likely to slow down the speed of the program.

A less good estimation is to use the max XM of the two reads.

hxin commented 5 years ago

We decided to use the max XM of the two pair as the mismatches for this pair.

hxin commented 5 years ago

Before/After making the change:

pe
2019-05-29 15:24:12 INFO: Species 1: wrote 4516 filtered hits for 269 reads; 3596 hits for 109 reads were rejected outright, and 0 hits for 0 reads were rejected as ambiguous.
2019-05-29 15:24:12 INFO: Species 2: wrote 914 filtered hits for 29 reads; 72 hits for 5 reads were rejected outright, and 0 hits for 0 reads were rejected as ambiguous.
2019-05-29 15:24:12 INFO: Species 3: wrote 0 filtered hits for 0 reads; 458 hits for 20 reads were rejected outright, and 0 hits for 0 reads were rejected as ambiguous.

2019-05-29 15:38:27 INFO: Species 1: wrote 2070 filtered hits for 163 reads; 6042 hits for 215 reads were rejected outright, and 0 hits for 0 reads were rejected as ambiguous.
2019-05-29 15:38:27 INFO: Species 2: wrote 720 filtered hits for 20 reads; 266 hits for 14 reads were rejected outright, and 0 hits for 0 reads were rejected as ambiguous.
2019-05-29 15:38:27 INFO: Species 3: wrote 0 filtered hits for 0 reads; 458 hits for 20 reads were rejected outright, and 0 hits for 0 reads were rejected as ambiguous.

se
2019-05-29 15:26:29 INFO: Species 1: wrote 3961 filtered hits for 318 reads; 2008 hits for 93 reads were rejected outright, and 1909 hits for 28 reads were rejected as ambiguous.
2019-05-29 15:26:29 INFO: Species 2: wrote 0 filtered hits for 0 reads; 417 hits for 7 reads were rejected outright, and 1413 hits for 17 reads were rejected as ambiguous.
2019-05-29 15:26:29 INFO: Species 3: wrote 128 filtered hits for 3 reads; 1227 hits for 67 reads were rejected outright, and 1910 hits for 28 reads were rejected as ambiguous.

2019-05-29 15:38:03 INFO: Species 1: wrote 3961 filtered hits for 318 reads; 2008 hits for 93 reads were rejected outright, and 1909 hits for 28 reads were rejected as ambiguous.
2019-05-29 15:38:03 INFO: Species 2: wrote 0 filtered hits for 0 reads; 417 hits for 7 reads were rejected outright, and 1413 hits for 17 reads were rejected as ambiguous.
2019-05-29 15:38:03 INFO: Species 3: wrote 128 filtered hits for 3 reads; 1227 hits for 67 reads were rejected outright, and 1910 hits for 28 reads were rejected as ambiguous.
hxin commented 5 years ago

We change our mind!

We decided to use the average XM of the two pair as the mismatches for this pair!

hxin commented 5 years ago
ERR2721212.10000029     99      19      54802536        255     101M    =       54802753        319     ATGCTGATATATCAATGTGCTGCAGAGTCAAAACCAGACACCAATAGTCAGCAATAAAGCATCATGCTCCAGGACTGAAGCTCATGAGTAGAGTGCTTGCT   GGGAGAGAGAG.<GGGA<.GGGGGGGAAGGGGG.AGAGGGAAGGAGGGGGGGGGGIIIGGIIIGIGGIGGGGGIGIGAGGGGGGAG.<<AGGGGGGGIIII   
AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:101        YS:i:-17        YT:Z:CP

ERR2721212.10000029     147     19      54802753        255     29M1D72M        =       54802536        -319    TCAGTGGGAGATGCTGTCTCAAAGAAAAGAAAAAAAAAAAAAAGAAGAAGAAGAAATATGATGAGAGAGAAGCAAAAGGTAACTGATGTCAACATCTGGCT   A.GG.<.<.<....AAA<<.<.<.AG<..GIIIIIIGIIIGGGGGGIGGGGAGIGGGIGIGGGGGGGGGG.AIGGAGGGGAAGGGGGGGGAIGGGGAGGAG   
AS:i:-17        XN:i:0  XM:i:3  XO:i:1  XG:i:1  NM:i:4  MD:Z:6A14C5T1^A72       YS:i:0  YT:Z:CP

Accepted by master, rejected by avg

hxin commented 5 years ago
ERR2721212.10000983 99  6   34489833    255 101M    =   34489995    263 GTGCAGCTTCCTAAAGGGGTCCAGTGCACTCCTCTCAAGACCAGCTACCCCACCCCCACCCCCCACTTAGTGCCCTACTGTAGCTGTGAAGGGCCCAGAGC   GGGAGGGGGIGIIIIIGIIGIIIIIIGGGIIIIGIIIGIIIGGGIIIIGIIAGGGGI.AGGI.A<<.<<<A<<GG....<GGAGGGGGAGAGG..G.<...   
AS:i:-3 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:62A38  YS:i:0  YT:Z:CP

ERR2721212.10000983 147 6   34489995    255 101M    =   34489833    -263    AGGCAGGGCTATCCTTATGTCCAGCTTCACAGCTCTAGGTGCTGGCGATTGCTGGCAGTCTTTGGCATTTATTCCCTGGTATTTGGACATCATTCCTGTCT   GIIIIIIIIIIIGIGIIIIIIGIGIIIIGGGGIIGIIIIIIIIIIIIIIIIIIIIIIGIIIIIIIIIIIIIIIIIIIIIIGIIIIIIGIIIIIIIIGGGGG   
AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:101    YS:i:-3 YT:Z:CP

Accepted by avg, rejected by master

# python integer works like this:
(1+0)/2=0
hxin commented 5 years ago

I ran tests on real chipseq data

I used four branches:

The target code is:

    @classmethod
    def _get_mismatches(cls, hits ):
        # https://github.com/statbio/Sargasso/issues/96
        if cls._is_paired_hit(hits[0]):
            return float(hits[0].get_tag("XM") + hits[1].get_tag("XM"))/2
        return hits[0].get_tag("XM")

The results:

mouse

Master assigned LESS corrected reads. This is probably due to the use of integer to calculate the average when (1,0) or (0,1) will have 0 mismatch as the average.

Master uses the mismatch from the first read of a pair. This is somewhat unstable. One example would be the master vs avg human data, where the master perform better in one sample(conservative 516.83K), but worse(conservative −585.67K)in the other.

Only when the mismatch has an uneven distribution, favouring in the first read, the master will perform better than AVG.

I think the AVG_float or even the SUM of mismatches(Not tested) of the two pair, make sense. However, this will result in losing a large amount of corrected assigned reads.

@lweasel

hxin commented 5 years ago

We finally decided to use the average float! fixed 74a7d94ff38c2c666c5b5436719b93a3a6a59fe9 Using average mismatch will give us less reads, but we might be able to get aound this by using --best structure instead of the --conservative.