christophertbrown / bioscripts

some useful scripts for working with genomics and sequencing data
MIT License
17 stars 8 forks source link

Parsing of "mate unmapped" reads in a SAM file #1

Open nigiord opened 6 years ago

nigiord commented 6 years ago

I have been trying to use iRep which seems to use the functionctbBio.mapped.get_reads() and ctbBio.mapped.reads_from_mapping() from this package to parse SAM files. The documentation mentions that the parser expect reads from the same pair to be one after the other in SAM file. Why not.

The problem I have is that I have reads with unmapped mate in my SAM files, and I suspect it could be the source of (silent) incoherent coverage results I get when I try to use iRep.

The parser in ctbBio tries to take this into account in its own way: https://github.com/christophertbrown/bioscripts/blob/master/ctbBio/mapped.py#L122-L150 In particular:

 if int(line[1]) <= 20: # is this from a single read?

I just checked, and the FLAGs (line[1] in the code above) of my single reads are way above 20. In fact, I do not understand where this 20 comes from, given the specifications of the SAM format https://broadinstitute.github.io/picard/explain-flags.html. For instance, I have single reads with FLAG 137, which means "read paired", "mate unmapped" and "second in pair".

Could you give a little more details about how ctbBio.mapped handles unmapped mates in the SAM file?

christophertbrown commented 6 years ago

Hi Nils,

iRep (and mapped.py) require that paired reads come one after the other in the SAM file so that mapping quality can easily be determined for both of the reads. (I’m not aware of a better way of handling this.)

With the default -mm 1 (max. # of read mismatches allowed), both reads in the pair have to map with no more than one mismatch. If one of the reads is not mapped, the pair is kept as long as it mapped with no more than one mismatch.

If your SAM file is not properly sorted, then iRep will not be able to correctly determine mapping quality for both reads in the pair. Admittedly, if you didn’t care about mapping quality (number of mismatches), iRep could just use the bit flag so you would not have to worry about sorting the SAM file. But, that is not the way that it is currently setup!

The part of the iRep code you referred to has to do with handling unpaired reads (e.g. orphaned reads from trimming, not cases where one read in the pair maps). iRep handles mappings that include both paired and single reads by simply skipping the single reads. Based on my reading of the bit flags, it is not possible to have a single read with a value greater than 20.

Hope this helps but let me know if you still have questions.

Best,

Chris

On Apr 26, 2018, at 5:24 AM, Nils Giordano notifications@github.com wrote:

I have been trying to use iRep which seems to use the functionctbBio.mapped.get_reads() and ctbBio.mapped.reads_from_mapping() from this package to parse SAM files. The documentation mentions that the parser expect reads from the same pair to be one after the other in SAM file. Why not.

The problem I have is that I have reads with unmatched pairs in my SAM files, and I suspect it could be the source of (silent) incoherent coverage results I get when I try to use iRep.

The parser in ctbBio tries to take this into account in its own way: https://github.com/christophertbrown/bioscripts/blob/master/ctbBio/mapped.py#L122-L150 In particular:

if int(line[1]) <= 20: # is this from a single read?

I just checked, and the FLAGs (line[1] in the code above) of my single reads are way above 20. In fact, I do not understand where this 20 comes from, given the specifications of the SAM format https://broadinstitute.github.io/picard/explain-flags.html. For instance, I have single reads with FLAG 137, which means "read paired", "mate unmapped" and "second in pair".

Could you give a little more details about how ctbBio.mapped handle unmapped mates in the SAM file?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

nigiord commented 6 years ago

Hi Chris and thank you for your answer,

Apparently the sorting of the SAM file is not the problem here (one can either use bowtie2 --reorder or samtools sort -n to get paired reads one after the other, as you properly documented). The problem is that I have single reads in my SAM file (reads without a mate) that should be filtered out by mapped.py but are not. Hence, each time the parser encounter such a single read, it silently shifts the "pair window" and messes the filtering of all the following pairs of reads.

The part of the iRep code you referred to has to do with handling unpaired reads (e.g. orphaned reads from trimming, not cases where one read in the pair maps). iRep handles mappings that include both paired and single reads by simply skipping the single reads. Based on my reading of the bit flags, it is not possible to have a single read with a value greater than 20.

From what I saw in my files, I think this assumption is false. You can have a pair from which only a single read was mapped, giving the bits "read paired" (0x1), "mate unmapped" (0x8) and "first in pair" (0x40) or "second in pair" (0x80) (depending if bowtie2 found the read in the -1 or -2 provided file). For instance, I definitively have single reads with FLAG 137 (meaning "read paired, mate unmapped, second in pair"), which is way above 20.

I found a workaround which consists into keeping only "properly paired reads" (both 0x1 and 0x2 bits set and 0x4 bit not set, see biostars and samtools flags documentation). This can be done using samtools, with the following command:

samtools view -f 0x1 -f 0x2 -F 0x4 mymapping.sam -o mymapping.filtered.sam

This decreases coverage but ensures that iRep (and mapped.py) correctly parse the SAM file since single reads are removed (so the part of code I mentioned is no longer used). I think this should at least be documented in iRep as the incorrect filtering is totally silent and can generate incoherent results.

If your SAM file is not properly sorted, then iRep will not be able to correctly determine mapping quality for both reads in the pair. Admittedly, if you didn’t care about mapping quality (number of mismatches), iRep could just use the bit flag so you would not have to worry about sorting the SAM file. But, that is not the way that it is currently setup!

I understand the logic behind this parsing of the SAM file, and why you want to improve mapping quality by filtering unspecific reads. However I do not see why it is absolutely needed for this filtering to be done pairwise. Is it really important to solely keep matching pairs of reads instead of matching reads? I see the point, but this looks overly stringent for most analysis.

I am also curious: Is this the reason why you chose to code your own SAM parser and not use, for instance, pysam? I think using a more standard parsing library would definitively improve the current iRep implementation by making it more flexible, and reduce the risk of side effects when the input files are not exactly as you expect them.

Best, Nils

christophertbrown commented 6 years ago

Hi Nils,

Sorry for not responding to this sooner. Have you made progress on this issue?

The problem is that I have single reads in my SAM file (reads without a mate) that should be filtered out by mapped.py but are not. Hence, each time the parser encounter such a single read, it silently shifts the "pair window" and messes the filtering of all the following pairs of reads.

Indeed this would be a problem, but I’m still not sure of the cause. Based the flags, these single reads should be skipped without messing up the pairing.

You can have a pair from which only a single read was mapped, giving the bits "read paired" (0x1), "mate unmapped" (0x8) and "first in pair" (0x40) or "second in pair" (0x80) (depending if bowtie2 found the read in the -1 or -2 provided file). For instance, I definitively have single reads with FLAG 137 (meaning "read paired, mate unmapped, second in pair"), which is way above 20.

In this example, the read is part of a pair so it is handled differently (i.e. not immediately thrown out, even though unmapped).

I found a workaround which consists into keeping only "properly paired reads" (both 0x1 and 0x2 bits set and 0x4 bit not set, see biostars and samtools flags documentation). This can be done using samtools, with the following command:

samtools view -f 0x1 -f 0x2 -F 0x4 mymapping.sam -o mymapping.filtered.sam This decreases coverage but ensures that iRep (and mapped.py) correctly parse the SAM file since single reads are removed (so the part of code I mentioned is no longer used). I think this should at least be documented in iRep as the incorrect filtering is totally silent and can generate incoherent results.

You should not have to do this, and it does indeed typically throw out a lot of reads.

I agree, iRep should be able to detect when there is likely an issue with the SAM file, and at least report it to a log.

I am also curious: Is this the reason why you chose to code your own SAM parser and not use, for instance, pysam? I think using a more standard parsing library would definitively improve the current iRep implementation by making it more flexible, and reduce the risk of side effects when the input files are not exactly as you expect them.

I did it this way for flexibility, but will look into whether or not the same things can be accomplished with pysam. Thanks for the suggestion!

Chris

On May 2, 2018, at 6:37 AM, Nils Giordano notifications@github.com wrote:

Hi Chris and thank you for your answer,

Apparently the sorting of the SAM file is not the problem here (one can either use bowtie2 --reorder or samtools sort -n to get paired reads one after the other, as you properly documented). The problem is that I have single reads in my SAM file (reads without a mate) that should be filtered out by mapped.py but are not. Hence, each time the parser encounter such a single read, it silently shifts the "pair window" and messes the filtering of all the following pairs of reads.

The part of the iRep code you referred to has to do with handling unpaired reads (e.g. orphaned reads from trimming, not cases where one read in the pair maps). iRep handles mappings that include both paired and single reads by simply skipping the single reads. Based on my reading of the bit flags, it is not possible to have a single read with a value greater than 20.

From what I saw in my files, I think this assumption is false. You can have a pair from which only a single read was mapped, giving the bits "read paired" (0x1), "mate unmapped" (0x8) and "first in pair" (0x40) or "second in pair" (0x80) (depending if bowtie2 found the read in the -1 or -2 provided file). For instance, I definitively have single reads with FLAG 137 (meaning "read paired, mate unmapped, second in pair"), which is way above 20.

I found a workaround which consists into keeping only "properly paired reads" (both 0x1 and 0x2 bits set and 0x4 bit not set, see biostars and samtools flags documentation). This can be done using samtools, with the following command:

samtools view -f 0x1 -f 0x2 -F 0x4 mymapping.sam -o mymapping.filtered.sam This decreases coverage but ensures that iRep (and mapped.py) correctly parse the SAM file since single reads are removed (so the part of code I mentioned is no longer used). I think this should at least be documented in iRep as the incorrect filtering is totally silent and can generate incoherent results.

If your SAM file is not properly sorted, then iRep will not be able to correctly determine mapping quality for both reads in the pair. Admittedly, if you didn’t care about mapping quality (number of mismatches), iRep could just use the bit flag so you would not have to worry about sorting the SAM file. But, that is not the way that it is currently setup!

I understand the logic behind this parsing of the SAM file, and why you want to improve mapping quality by filtering unspecific reads. However I do not see why it is absolutely needed for this filtering to be done pairwise. Is it really important to solely keep matching pairs of reads instead of matching reads? I see the point, but this looks overly stringent for most analysis.

I am also curious: Is this the reason why you chose to code your own SAM parser and not use, for instance, pysam? I think using a more standard parsing library would definitively improve the current iRep implementation by making it more flexible, and reduce the risk of side effects when the input files are not exactly as you expect them.

Best, Nils

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

nigiord commented 6 years ago

Hi Chris,

Thank you for your interest. As I told you I have worked around the issue by keeping only "properly paired reads", which has worked fine so far (I do not seem to lose that many reads in my data).

Based the flags, these single reads should be skipped without messing up the pairing. […] In this example, the read is part of a pair so it is handled differently (i.e. not immediately thrown out, even though unmapped). [...] I agree, iRep should be able to detect when there is likely an issue with the SAM file, and at least report it to a log.

It might be an issue with the SAM files (they are generated by bowtie2), I do not know if this is normal to have reads with unmapped paired by default. I just know that my SAM files have "single reads" with FLAGS > 20 (which seems to be common when one of the reads is unmapped, see http://www.samformat.info/sam-format-flag), and according to this piece of code https://github.com/christophertbrown/bioscripts/blob/master/ctbBio/mapped.py#L122-L150 this case is not properly handled by ctbBio.mapped (the counter n is iterated, which shifts the "pair" window for the subsequent reads).

A solution would be to replace:

if int(line[1]) <= 20: # is this from a single read?

by something like this

if int(line[1]) & 0b1000: # is the mate read unmapped? (0x8 flag set to 1)

Note that I still do not understand why the current <= 20 would be a test for single reads.

I did it this way for flexibility, but will look into whether or not the same things can be accomplished with pysam.

I understand the intent, although in my opinion it is often preferable to use standard parsing tools when they are available.

I still do not know how far I will go in my work with iRep since I am still unsure it is appropriate for my problem (large scale analysis on environmental metagenomic data, so the coverage/window thresholds are rarely fulfilled due to the high strain variations). Would I find a solution to adapt the method (or my dataset), I'll probably try to also contribute for the pysam implementation.

Best, Nils

christophertbrown commented 6 years ago

Hi Nils,

It might be an issue with the SAM files (they are generated by bowtie2), I do not know if this is normal to have reads with unmapped paired by default.

I always use bowtie2 and there are pretty much always cases where there are reads with unmapped pairs, so that should not be the problem.

I just know that my SAM files have "single reads" with FLAGS > 20 (which seems to be common when one of the reads is unmapped, see http://www.samformat.info/sam-format-flag), and according to this piece of code https://github.com/christophertbrown/bioscripts/blob/master/ctbBio/mapped.py#L122-L150 this case is not properly handled by ctbBio.mapped (the counter n is iterated, which shifts the "pair" window for the subsequent reads).

I think we are thinking of something different when it comes to ‘single reads.’ That flag is to handle reads that don’t have a pair, not reads that have a pair that is not mapped.

I’d be happy to look at some example ‘good’ and ‘bad’ reads in your SAM file to see if I might be missing something else.

I’m glad you were able to come up with a work around for your dataset. I’d be interested to hear how you end up using iRep.

Best,

Chris

On Jun 11, 2018, at 6:39 AM, Nils Giordano notifications@github.com wrote:

Hi Chris,

Thank you for your interest. As I told you I have worked around the issue by keeping only "properly paired reads", which has worked fine so far (I do not seem to lose that many reads in my data).

Based the flags, these single reads should be skipped without messing up the pairing. […] In this example, the read is part of a pair so it is handled differently (i.e. not immediately thrown out, even though unmapped). [...] I agree, iRep should be able to detect when there is likely an issue with the SAM file, and at least report it to a log.

It might be an issue with the SAM files (they are generated by bowtie2), I do not know if this is normal to have reads with unmapped paired by default. I just know that my SAM files have "single reads" with FLAGS > 20 (which seems to be common when one of the reads is unmapped, see http://www.samformat.info/sam-format-flag), and according to this piece of code https://github.com/christophertbrown/bioscripts/blob/master/ctbBio/mapped.py#L122-L150 this case is not properly handled by ctbBio.mapped (the counter n is iterated, which shifts the "pair" window for the subsequent reads).

A solution would be to replace:

if int(line[1]) <= 20: # is this from a single read? by something like this

if int(line[1]) & 0b1000: # is the mate read unmapped? (0x8 flag set to 1) Note that I still do not understand why the current <= 20 would be a test for single reads.

I did it this way for flexibility, but will look into whether or not the same things can be accomplished with pysam.

I understand the intent, although in my opinion it is often preferable to use standard parsing tools when they are available.

I still do not know how far I will go in my work with iRep since I am still unsure it is appropriate for my problem (large scale analysis on environmental metagenomic data, so the coverage/window thresholds are rarely fulfilled due to the high strain variations). Would I find a solution to adapt the method (or my dataset), I'll probably try to also contribute for the pysam implementation.

Best, Nils

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

nigiord commented 6 years ago

Hi Chris,

I agree I was not very clear in the discussion above, I'll try to summarize the problem below with an actual example.

Let's for instance consider the following fragment of a sorted SAM file:

H3:D179JACXX:6:1101:1133:18005  99  166314.PRJNA37911.CP006882  533238  40  96M =   533335  187 TTCGGNCAGTACGGCGGAGGCGTTGAGAGAGAGATTCAGCAGCTCACCAGGGGCAGCCGTGGCCTGGGTGGCGAGAAGGAAAGCACCCATGTCGGG    @@CDD#2ADHHHHJIJIHGHIIIJE=EEFEC?D?BA@CEEECDDCDCBCDBBBBD@?B<BDD;ADDDD35>BD;B5<CBACC@CA9AB?>A>>A@B    AS:i:-15    XN:i:0  XM:i:4  XO:i:0  XG:i:0  NM:i:4  MD:Z:0C4G6A45A37    YS:i:-12    YT:Z:CP
H3:D179JACXX:6:1101:1133:18005  147 166314.PRJNA37911.CP006882  533335  40  90M =   533238  -187    ATGGCAGCCTGGAACCGTNGCGCTGTGGCGACTGTAGCGGCGCTTGACGCGCCGACGGTCATGCAGAGCGCTGCTCGGTGCGATAAAGAA  DDDDDABCCDCB?8882+#;BA?DDDBB;CBDDDDBDBD@BDDDDBCEHHGGAIGHIHGFGGIGIGIHGGIJJJJJJHHHHHFFFFF@C@  AS:i:-12    XN:i:0  XM:i:3  XO:i:0  XG:i:0  NM:i:3  MD:Z:3A14T51T19 YS:i:-15    YT:Z:CP
H3:D179JACXX:6:1101:1133:18497  83  166314.PRJNA37911.CP006882  1274255 42  100M    =   1274185 -170    GCTTCAGTGGTCCTATGCCTCAGGCGAAACTCACCATCGGTGAATTAGAGGCGGGCTATCCGATGTATTGCAAGGCGCTGCGGCGTCTCCTCCANCAGGG    C@@CDDA<BCCA>CCB??CC@BDDB@@:4BA>4B;7@?DDDDEEDDDDDDDDFC@EHHIGIJIHJJIIIJGGGIJJIJIJJJJJIIIGHHFCA2#FFCCC    AS:i:-1 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:94G5   YS:i:-9 YT:Z:CP
H3:D179JACXX:6:1101:1133:18497  163 166314.PRJNA37911.CP006882  1274185 42  61M =   1274255 170 CGGGTGGAAATCGACTGCGTGGCCTGGCTCGGCAGCTGAGCCTGGCCCGCGGGGGCATAGG   @@CFDDFDHHHHFGHIJJEHHIHFHFHIIJ?FHABEGEACEG7?ECEB4<9?BBD88<@:@   AS:i:-9 XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:2  MD:Z:2C39A18    YS:i:-1 YT:Z:CP
H3:D179JACXX:6:1101:1135:40985  99  ENA|DBLP01000022|DBLP01000022.1 247009  42  100M    =   247083  175 GATTGGCGTAATACGGTAAGATACAGGCCGGTGAAACAATAGCTTTAGTGTGAGTTGCGACAGGATGTCAATTCTTGAAGTTAACGGATTAGTCAAAACT    @@CFFFFDHHHHHJJJGHIJIJJIGJGIJJJ6@DHGGHGIJIJJJIGCFHIIII@;ADEHEFDDBCCDDADFDDDC>CCDCCDEDD78&5>:@CCCC@AC    AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:100    YS:i:0  YT:Z:CP
H3:D179JACXX:6:1101:1135:40985  147 ENA|DBLP01000022|DBLP01000022.1 247083  42  101M    =   247009  -175    TTGAAGTTAACGGATTAGTCAAAACTTATGGTCGTAAGCGGGTCGTCAACGGCGTTAGCTTTTCGGTCAATGCCGGTGAAATTGTAGGTCTGCTGGGCCCT   CDDEDDDDDDBCC@@@DDDDDEECDCDCBDDDDDBDDBDDDDBBCBDDDEHHJJHFJJIIJIGGIIHGIIIHGHIJJIJJJJJJJJIJGHGHFFFFDFB@@   AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:101    YS:i:0  YT:Z:CP
H3:D179JACXX:6:1101:1135:98416  83  166314.PRJNA37911.CP006882  1491193 40  99M =   1491108 -184    CAGACGGTTCACGCGAGCTGGGTTCACCTGCACTGGGCAAGCTCTACGACGATCTGCTCCCGATGGCGCGCCGCGCTCGAAGCCGGGGAGANAAAATTG B?9<<?><25)<5)448??BBC>:C?@CACA@?CCC@C>:AA<B<9B?<BAC?CBBBBB?@CB@BBCBEGIIIIIIIIIIIGHGDHFHD<2#DDDD@@@ AS:i:-11    XN:i:0  XM:i:3  XO:i:0  XG:i:0  NM:i:3  MD:Z:91A2G2C1   YS:i:-21    YT:Z:CP
H3:D179JACXX:6:1101:1135:98416  163 166314.PRJNA37911.CP006882  1491108 40  99M =   1491193 184 CAGCAACCGGCCGTCCTCCGACAGGTCAGTGCTGTGGGACATTGAGGGATGGAAAGTCCATCGACATTCGGAAGGATGGGTCGATCAGACGGTTCACGC B@@DDBDEHHGHGHHHJJJJII<DDBFGHFHGGAFHHIEIEGGJGEFHBCBDCCCECE@>CDBDDDDCCEDB@BDACDDD9A@B7?<@C><B><8@CDD AS:i:-21    XN:i:0  XM:i:4  XO:i:0  XG:i:0  NM:i:4  MD:Z:2C8T5A27T53    YS:i:-11    YT:Z:CP
H3:D179JACXX:6:1101:1139:100353 153 166314.PRJNA37911.CP006882  1438860 8   96M =   1438860 0   GCACCTGCAGCGGAGCCCCAGGGGTGACCGTGGGTACCACCACCGAACTGAAGAACGGAGTCGTCGCCGAAGATGGTGACCAGGGCGGGCATGTGC    @C<<>9<99DDDDBB<8808>A>(DB?BBB>CC>CB??8?DDDCCB>CBDC??;6=EBA;<GGHGFEGIGCHGDCC4CA<GGHHHFHHDFDDF@CB    AS:i:-32    XN:i:0  XM:i:7  XO:i:0  XG:i:0  NM:i:7  MD:Z:50C2C11A5A4C6T2A9  YT:Z:UP
H3:D179JACXX:6:1101:1140:51253  83  166314.PRJNA37911.CP006882  1452606 42  101M    =   1452532 -175    GGGAGAAGACCCAGTCAGCCACGCGGGTGAATTTGGAGAGACCGATCACCCGGGCACCAGGTTTGATCCCGATCCAGCAGTTGCCCATGATCGGCACCAGG   C@CDC><?<DDDCA8DD@<D>BBDDCC@:DDDDDDDDCBBADDEEEEBHJJJJJIIJJJJJIIJIJHJJGJIHFIJJIJJJJJIIJJJHHGHHFDDFFCC@   AS:i:-5 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:13A87  YS:i:-10    YT:Z:CP
H3:D179JACXX:6:1101:1140:51253  163 166314.PRJNA37911.CP006882  1452532 42  101M    =   1452606 175 GAGACCCTGGGGCTCACAGAGCTTCTCGATTTCATCGGCAAGGATCATCACAGCCTCTTCCTGGATGTGAGGGCGGGAGAAGACCCAGTCAGCCACGCGGG   CCCFFFFFGGHGHIJJJJHIIJJJJGIIJIHIHIJJGHIJIBEGCHEGGHIIIIGGHHIIHHHH6@CDFFFEDDDDBBDBBDCDDD?<>CCCCCBC5>9BB   AS:i:-10    XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:2  MD:Z:33G53A13   YS:i:-5 YT:Z:CP
H3:D179JACXX:6:1101:1140:57121  83  ENA|DBLP01000008|DBLP01000008.1 141650  42  101M    =   141571  -180    TACTCGCCGGTATAAAGTCCGGTGATGAGAGACAATGGGCCCCCATTCAATTTCACTTGCTGAGCAAAACGGAGAAAGGCAGCAGGGTCACTGTCGCTTTC   899@D@B9>CDDDCC>@>BBCCDCC>CC@DEEDDEDDDBBGGIIIGGHHDGIIHGHCIGIGGIHCIIHFIJJIGJIGIIIJIIJJJIIHHDHHFFFFF@@@   AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:101    YS:i:0  YT:Z:CP
H3:D179JACXX:6:1101:1140:57121  163 ENA|DBLP01000008|DBLP01000008.1 141571  42  78M =   141650  180 GCTAAGTGTTTTTCCAAAGGTCAGTGTGGGAGCGATGGAAAGTAACGCGATATAGAACGGATGTTGATTGGCTTGTAT  B@BFFFFFHHHHHIJGIEG@AFGG:FFCGGEGGIGEIJIIIJCFHHAHHIHAEEEFHGFFB>BCCCACCCA>@?8<>C  AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:78 YS:i:0  YT:Z:CP

The "bad" read here is H3:D179JACXX:6:1101:1139:100353. If I follow exactly what bioscripts/mapped.py does, since the flag is above > 20, c is incremented so the parser treats this read as paired with the next one H3:D179JACXX:6:1101:1140:51253. This is wrong and will mess up the check_mismatches procedure. Worse: the incrementation error is conserved through the rest of the SAM file (until we encounter a new "solo" read like this).

The faulty read has a FLAG 153 which means:

read paired (0x1) mate unmapped (0x8) read reverse strand (0x10) second in pair (0x80)

I was calling it "single read" because the mate is not present in the SAM file. But I agree the term is technically incorrect: this is a paired read but the mate is not present in the file because it has been filtered out.

Command that generated the SAM file:

bowtie2 -x basename -1 1.fastq -2 2.fastq -S mapping.sam --no-unal

I suppose the problem does not occur if I do not filter out unmapped reads (with the --no-unal parameter of bowtie2), but the iRep README explicitly recommends to shrink the SAM file by filtering out unmapped reads.

Current workaround

Not filtering out unmapped reads is not an option for me because that would increase 20-50 times the size of the SAM files in my workflow. As stated above in the discussion, the solution I found was to add an additional filter that filters out the reads that are not "properly paired" using the command:

samtools view -f 0x1 -f 0x2 -F 0x4 <input> -o <output>

This effectively removes the "solo" reads (as well as pairs where reads do not map close to each other), hence the parsing bug does not occur anymore.

Best, Nils

christophertbrown commented 6 years ago

Hi Nils,

Thanks for this additional information, it is super helpful!

On Jun 26, 2018, at 6:32 AM, Nils Giordano notifications@github.com wrote:

The faulty read has a FLAG 153 which means:

read paired (0x1) mate unmapped (0x8) read reverse strand (0x10) second in pair (0x80)

I was calling it "single read" because the mate is not present in the SAM file. But I agree the term is technically incorrect: this is a paired read but the mate is not present in the file because it had been filtered out.

Command that generated the SAM file:

bowtie2 -x basename -1 1.fastq -2 2.fastq -S mapping.sam --no-unal I suppose the problem does not occur if I do not filter out unmapped reads (with the --no-unal parameter of bowtie2), but the iRep README explicitly recommends to shrink the SAM file by filtering out unmapped reads.

Indeed, this is the problem, and why I was having trouble figuring out what was happening.

I completely understand not wanting to keep unmapped reads in the SAM file, but in our experience the —no-unal option is not sufficient for exactly this reason - you lose the paired read. In this case, we just want to throw the read out anyway, but in some cases it can be really useful to save the unmapped paired read. Sorry, I did not realize that you were using this option.

I recommend using shrinksam (https://github.com/bcthomas/shrinksam) instead.

Your current colution seems more than adequate, but I can post additional information and will consider changing the parsing options to handle this better (ultimately we do just want to throw out this read). I think it would indeed be smarter to check the full bit flag.

Let me know if you have questions.

Best,

Chris