Closed hazmup closed 3 months ago
Note for future readers: I was misreading the flags of the improper reads. They were chimeric reads and it seems the intended behavior with the default --chimeric-pairs=use
is to remove one of the reads. I am now running with --chimeric-pairs=discard
as well.
I am reopening this issue because including --chimeric-pairs=discard
still resulted in a .bam
file with orphan reads.
The most likely explanation of this is that you have reads that claim to be paired (and not chimeric) but UMI-tools can't find the mate. When UMI-tools does its output, it makes a decision based on the information available in read1. It then outputs the read immediately, and makes a note to output the mate when it encounters it. It does a second sweep at the end to try and find anything it didn't find the first time through. So, if a read claims to be paired, but the read identified in the mate chr and pos fields isn't in the file, then you'll end up with unpaired reads in the output. The log file should say how many reads this was the case for.
It seems there are reads with secondary alignments to the same coordinates (so 4 lines in the input bam), and in the deduplicated output there is one of the primary alignments and one of the secondaries left for each pair. I.e. there are quadruplets of pairs with flags 99, 355, 147, 403, and after the deduplication step only the 355, 147 flagged reads remain. There must be a few more orphan reads, which I am trying to fish out that are causing problems.
I have managed to get valid .fastq
files by filtering out the secondary, supplementary and non-properly paired reads (with samtools view -f2 -F2304
) from the input .bam
files, although I do not know if there is a better/more elegant way of going about it. Thank you again.
Hi Stelios,
I'm not quite sure what the problem is. In your example, you said start
with 4 alignments, two from the read1 and two from the read2, and two of
these are output. What would you expect to happen instead? It might be that
running the results through prepare-for-rsem
would help, as this does
things like copying a read2 where it is the pair of multiple read1
alignments, and sets the paring information so that read1 and read2 in a
pair always point to each other, even if this wasn't the case in the
original bam.
Hi @IanSudbery,
I have not yet been able to identify all the reads that are causing trouble. Maybe the example I mentioned was resolved by using samtools sort
instead of samtools collate
which only takes into account primary alignments. Apologies, I was not sure what the cause of all of my orphan reads was earlier.
Now it seems I have at least one problem, in one case a bunch (10) of reads in the input bam that result is three reads remaining after deduplication.
Reads in input:
V350189085L1C001R00100098693_TTTGAGGATTAG 355 chr14 25884192 1 73M1746N14M = 25886015 157 CGAGCTTGACCTGCTGGATATCCGAGCAGAGTATAAGCGGATGTACGGCAAGACACTGTACCACGATATCACGGGAGATACTTCTGG F3B;FCD;=GGCEGCCD:CADCHC8/HDCDD>ECC<FBD;4BDCBA;-GB??-BBF9ACAFHEDFDA/@EBF?:=C7+C:DDC88A< AS:i:-3 ZS:i:-3 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:52T34 YS:i:0 YT:Z:CP XS:A:+ NH:i:5
V350189085L1C001R00100098693_TTTGAGGATTAG 355 chr14 25884192 1 73M1746N14M = 26165399 279541 CGAGCTTGACCTGCTGGATATCCGAGCAGAGTATAAGCGGATGTACGGCAAGACACTGTACCACGATATCACGGGAGATACTTCTGG F3B;FCD;=GGCEGCCD:CADCHC8/HDCDD>ECC<FBD;4BDCBA;-GB??-BBF9ACAFHEDFDA/@EBF?:=C7+C:DDC88A< AS:i:-3 ZS:i:-3 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:52T34 YS:i:0 YT:Z:CP XS:A:+ NH:i:5
V350189085L1C001R00100098693_TTTGAGGATTAG 403 chr14 25886015 1 80M = 25884192 -157 ATACTTCTGGAGACTATCGAAAGATTCTACTGAAGATCTGTGGTGGCAACGACTGAGCTGTGGCAGTGGCTCACCTGTGC A?4D,>E>FFDE@<1A@3GBA>EB?A90A6=EABD<@@DG;FF3FGEC@?G@E>GBFD?G7HF?DGEFFF=9>0C>FCC8 AS:i:0 ZS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:80 YS:i:-3 YT:Z:CP NH:i:5
V350189085L1C001R00100098693_TTTGAGGATTAG 355 chr14 26023962 1 73M1746N14M = 26165399 139771 CGAGCTTGACCTGCTGGATATCCGAGCAGAGTATAAGCGGATGTACGGCAAGACACTGTACCACGATATCACGGGAGATACTTCTGG F3B;FCD;=GGCEGCCD:CADCHC8/HDCDD>ECC<FBD;4BDCBA;-GB??-BBF9ACAFHEDFDA/@EBF?:=C7+C:DDC88A< AS:i:-3 ZS:i:-3 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:52T34 YS:i:0 YT:Z:CP XS:A:+ NH:i:5
V350189085L1C001R00100098693_TTTGAGGATTAG 355 chr14 26023962 1 73M1746N14M = 26025785 157 CGAGCTTGACCTGCTGGATATCCGAGCAGAGTATAAGCGGATGTACGGCAAGACACTGTACCACGATATCACGGGAGATACTTCTGG F3B;FCD;=GGCEGCCD:CADCHC8/HDCDD>ECC<FBD;4BDCBA;-GB??-BBF9ACAFHEDFDA/@EBF?:=C7+C:DDC88A< AS:i:-3 ZS:i:-3 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:52T34 YS:i:0 YT:Z:CP XS:A:+ NH:i:5
V350189085L1C001R00100098693_TTTGAGGATTAG 403 chr14 26025785 1 80M = 26023962 -157 ATACTTCTGGAGACTATCGAAAGATTCTACTGAAGATCTGTGGTGGCAACGACTGAGCTGTGGCAGTGGCTCACCTGTGC A?4D,>E>FFDE@<1A@3GBA>EB?A90A6=EABD<@@DG;FF3FGEC@?G@E>GBFD?G7HF?DGEFFF=9>0C>FCC8 AS:i:0 ZS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:80 YS:i:-3 YT:Z:CP NH:i:5
V350189085L1C001R00100098693_TTTGAGGATTAG 99 chr14 26163576 1 73M1746N14M = 26165399 157 CGAGCTTGACCTGCTGGATATCCGAGCAGAGTATAAGCGGATGTACGGCAAGACACTGTACCACGATATCACGGGAGATACTTCTGG F3B;FCD;=GGCEGCCD:CADCHC8/HDCDD>ECC<FBD;4BDCBA;-GB??-BBF9ACAFHEDFDA/@EBF?:=C7+C:DDC88A< AS:i:-3 ZS:i:-3 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:52T34 YS:i:0 YT:Z:CP XS:A:+ NH:i:5
V350189085L1C001R00100098693_TTTGAGGATTAG 147 chr14 26165399 1 80M = 26163576 -157 ATACTTCTGGAGACTATCGAAAGATTCTACTGAAGATCTGTGGTGGCAACGACTGAGCTGTGGCAGTGGCTCACCTGTGC A?4D,>E>FFDE@<1A@3GBA>EB?A90A6=EABD<@@DG;FF3FGEC@?G@E>GBFD?G7HF?DGEFFF=9>0C>FCC8 AS:i:0 ZS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:80 YS:i:-3 YT:Z:CP NH:i:5
V350189085L1C001R00100098693_TTTGAGGATTAG 403 chr14 26165399 1 80M = 26023962 -139771 ATACTTCTGGAGACTATCGAAAGATTCTACTGAAGATCTGTGGTGGCAACGACTGAGCTGTGGCAGTGGCTCACCTGTGC A?4D,>E>FFDE@<1A@3GBA>EB?A90A6=EABD<@@DG;FF3FGEC@?G@E>GBFD?G7HF?DGEFFF=9>0C>FCC8 AS:i:0 ZS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:80 YS:i:-3 YT:Z:CP NH:i:5
V350189085L1C001R00100098693_TTTGAGGATTAG 403 chr14 26165399 1 80M = 25884192 -279541 ATACTTCTGGAGACTATCGAAAGATTCTACTGAAGATCTGTGGTGGCAACGACTGAGCTGTGGCAGTGGCTCACCTGTGC A?4D,>E>FFDE@<1A@3GBA>EB?A90A6=EABD<@@DG;FF3FGEC@?G@E>GBFD?G7HF?DGEFFF=9>0C>FCC8 AS:i:0 ZS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:80 YS:i:-3 YT:Z:CP NH:i:5
Output reads:
V350189085L1C001R00100098693_TTTGAGGATTAG 355 chr14 25884192 1 73M1746N14M = 26165399 279541 CGAGCTTGACCTGCTGGATATCCGAGCAGAGTATAAGCGGATGTACGGCAAGACACTGTACCACGATATCACGGGAGATACTTCTGG F3B;FCD;=GGCEGCCD:CADCHC8/HDCDD>ECC<FBD;4BDCBA;-GB??-BBF9ACAFHEDFDA/@EBF?:=C7+C:DDC88A< AS:i:-3 ZS:i:-3 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:52T34 YS:i:0 YT:Z:CP XS:A:+ NH:i:5
V350189085L1C001R00100098693_TTTGAGGATTAG 355 chr14 26023962 1 73M1746N14M = 26165399 139771 CGAGCTTGACCTGCTGGATATCCGAGCAGAGTATAAGCGGATGTACGGCAAGACACTGTACCACGATATCACGGGAGATACTTCTGG F3B;FCD;=GGCEGCCD:CADCHC8/HDCDD>ECC<FBD;4BDCBA;-GB??-BBF9ACAFHEDFDA/@EBF?:=C7+C:DDC88A< AS:i:-3 ZS:i:-3 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:52T34 YS:i:0 YT:Z:CP XS:A:+ NH:i:5
V350189085L1C001R00100098693_TTTGAGGATTAG 147 chr14 26165399 1 80M = 26163576 -157 ATACTTCTGGAGACTATCGAAAGATTCTACTGAAGATCTGTGGTGGCAACGACTGAGCTGTGGCAGTGGCTCACCTGTGC A?4D,>E>FFDE@<1A@3GBA>EB?A90A6=EABD<@@DG;FF3FGEC@?G@E>GBFD?G7HF?DGEFFF=9>0C>FCC8 AS:i:0 ZS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:80 YS:i:-3 YT:Z:CP NH:i:5
You will see that the first two reads both point to the third read as their mate. In SAM specification it says that a read should always point to the primary alignment of its mates, even if that read does not point to it. In theory both primary and supplementary alignments of read1 should point to the primary alignment of read2 and vica versa. However, only have 50% of aligners follow this rule.
The case you post above would be solved by running prepare-for-rsem
- the
third read would be duplicated so that one copy pointed to the first read,
and one to the second, and the files sorted so that it always went read1,
read2, read1, read2 etc.
I see, thank you again, I will check some more to see if this fixes all of my problematic reads and get back if needed. I am marking it closed for now if that's ok.
Hi, and thank you for the great tool!
I am trying to get
.bam
files afterdedup
without any orphan reads.Initially I used
--unmapped-reads use
to keep all reads, but after reading on this #516, #519 and #465 I understood I had to use--unmapped-reads discard --unpaired-reads discard
. However, I am still getting both orphan and unmapped reads with bitwise flags 65, 129. These cause issues when I try to get proper.fastq
files from the.bam
for a specific analysis. I have checked that the input toumi-tools dedup
does not contain duplicate or orphan reads.My workflow looks like this:
.fastq
file.Is this expected behavior? How can I get proper
.fastq
files fromumi-tools dedup
?