CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
481 stars 190 forks source link

Unmapped and unpaired reads retained even when selected not to. #647

Closed hazmup closed 3 months ago

hazmup commented 4 months ago

Hi, and thank you for the great tool!

I am trying to get .bam files after dedup without any orphan reads.

Initially I used --unmapped-reads use to keep all reads, but after reading on this #516, #519 and #465 I understood I had to use --unmapped-reads discard --unpaired-reads discard. However, I am still getting both orphan and unmapped reads with bitwise flags 65, 129. These cause issues when I try to get proper .fastq files from the .bam for a specific analysis. I have checked that the input to umi-tools dedup does not contain duplicate or orphan reads.

My workflow looks like this:

Is this expected behavior? How can I get proper .fastq files from umi-tools dedup?

hazmup commented 4 months ago

Note for future readers: I was misreading the flags of the improper reads. They were chimeric reads and it seems the intended behavior with the default --chimeric-pairs=use is to remove one of the reads. I am now running with --chimeric-pairs=discard as well.

hazmup commented 4 months ago

I am reopening this issue because including --chimeric-pairs=discard still resulted in a .bam file with orphan reads.

IanSudbery commented 4 months ago

The most likely explanation of this is that you have reads that claim to be paired (and not chimeric) but UMI-tools can't find the mate. When UMI-tools does its output, it makes a decision based on the information available in read1. It then outputs the read immediately, and makes a note to output the mate when it encounters it. It does a second sweep at the end to try and find anything it didn't find the first time through. So, if a read claims to be paired, but the read identified in the mate chr and pos fields isn't in the file, then you'll end up with unpaired reads in the output. The log file should say how many reads this was the case for.

hazmup commented 4 months ago

It seems there are reads with secondary alignments to the same coordinates (so 4 lines in the input bam), and in the deduplicated output there is one of the primary alignments and one of the secondaries left for each pair. I.e. there are quadruplets of pairs with flags 99, 355, 147, 403, and after the deduplication step only the 355, 147 flagged reads remain. There must be a few more orphan reads, which I am trying to fish out that are causing problems.

hazmup commented 3 months ago

I have managed to get valid .fastq files by filtering out the secondary, supplementary and non-properly paired reads (with samtools view -f2 -F2304) from the input .bam files, although I do not know if there is a better/more elegant way of going about it. Thank you again.

IanSudbery commented 3 months ago

Hi Stelios,

I'm not quite sure what the problem is. In your example, you said start with 4 alignments, two from the read1 and two from the read2, and two of these are output. What would you expect to happen instead? It might be that running the results through prepare-for-rsem would help, as this does things like copying a read2 where it is the pair of multiple read1 alignments, and sets the paring information so that read1 and read2 in a pair always point to each other, even if this wasn't the case in the original bam.

hazmup commented 3 months ago

Hi @IanSudbery, I have not yet been able to identify all the reads that are causing trouble. Maybe the example I mentioned was resolved by using samtools sort instead of samtools collate which only takes into account primary alignments. Apologies, I was not sure what the cause of all of my orphan reads was earlier. Now it seems I have at least one problem, in one case a bunch (10) of reads in the input bam that result is three reads remaining after deduplication. Reads in input:

V350189085L1C001R00100098693_TTTGAGGATTAG       355     chr14   25884192        1       73M1746N14M     =       25886015    157     CGAGCTTGACCTGCTGGATATCCGAGCAGAGTATAAGCGGATGTACGGCAAGACACTGTACCACGATATCACGGGAGATACTTCTGG F3B;FCD;=GGCEGCCD:CADCHC8/HDCDD>ECC<FBD;4BDCBA;-GB??-BBF9ACAFHEDFDA/@EBF?:=C7+C:DDC88A<     AS:i:-3 ZS:i:-3 XN:i:0  XM:i:1  XO:i:0      XG:i:0  NM:i:1  MD:Z:52T34      YS:i:0  YT:Z:CP XS:A:+  NH:i:5
V350189085L1C001R00100098693_TTTGAGGATTAG       355     chr14   25884192        1       73M1746N14M     =       26165399    279541  CGAGCTTGACCTGCTGGATATCCGAGCAGAGTATAAGCGGATGTACGGCAAGACACTGTACCACGATATCACGGGAGATACTTCTGG F3B;FCD;=GGCEGCCD:CADCHC8/HDCDD>ECC<FBD;4BDCBA;-GB??-BBF9ACAFHEDFDA/@EBF?:=C7+C:DDC88A<     AS:i:-3 ZS:i:-3 XN:i:0  XM:i:1  XO:i:0      XG:i:0  NM:i:1  MD:Z:52T34      YS:i:0  YT:Z:CP XS:A:+  NH:i:5
V350189085L1C001R00100098693_TTTGAGGATTAG       403     chr14   25886015        1       80M     =       25884192   -157     ATACTTCTGGAGACTATCGAAAGATTCTACTGAAGATCTGTGGTGGCAACGACTGAGCTGTGGCAGTGGCTCACCTGTGC        A?4D,>E>FFDE@<1A@3GBA>EB?A90A6=EABD<@@DG;FF3FGEC@?G@E>GBFD?G7HF?DGEFFF=9>0C>FCC8    AS:i:0  ZS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0      MD:Z:80 YS:i:-3 YT:Z:CP NH:i:5
V350189085L1C001R00100098693_TTTGAGGATTAG       355     chr14   26023962        1       73M1746N14M     =       26165399    139771  CGAGCTTGACCTGCTGGATATCCGAGCAGAGTATAAGCGGATGTACGGCAAGACACTGTACCACGATATCACGGGAGATACTTCTGG F3B;FCD;=GGCEGCCD:CADCHC8/HDCDD>ECC<FBD;4BDCBA;-GB??-BBF9ACAFHEDFDA/@EBF?:=C7+C:DDC88A<     AS:i:-3 ZS:i:-3 XN:i:0  XM:i:1  XO:i:0      XG:i:0  NM:i:1  MD:Z:52T34      YS:i:0  YT:Z:CP XS:A:+  NH:i:5
V350189085L1C001R00100098693_TTTGAGGATTAG       355     chr14   26023962        1       73M1746N14M     =       26025785    157     CGAGCTTGACCTGCTGGATATCCGAGCAGAGTATAAGCGGATGTACGGCAAGACACTGTACCACGATATCACGGGAGATACTTCTGG F3B;FCD;=GGCEGCCD:CADCHC8/HDCDD>ECC<FBD;4BDCBA;-GB??-BBF9ACAFHEDFDA/@EBF?:=C7+C:DDC88A<     AS:i:-3 ZS:i:-3 XN:i:0  XM:i:1  XO:i:0      XG:i:0  NM:i:1  MD:Z:52T34      YS:i:0  YT:Z:CP XS:A:+  NH:i:5
V350189085L1C001R00100098693_TTTGAGGATTAG       403     chr14   26025785        1       80M     =       26023962   -157     ATACTTCTGGAGACTATCGAAAGATTCTACTGAAGATCTGTGGTGGCAACGACTGAGCTGTGGCAGTGGCTCACCTGTGC        A?4D,>E>FFDE@<1A@3GBA>EB?A90A6=EABD<@@DG;FF3FGEC@?G@E>GBFD?G7HF?DGEFFF=9>0C>FCC8    AS:i:0  ZS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0      MD:Z:80 YS:i:-3 YT:Z:CP NH:i:5
V350189085L1C001R00100098693_TTTGAGGATTAG       99      chr14   26163576        1       73M1746N14M     =       26165399    157     CGAGCTTGACCTGCTGGATATCCGAGCAGAGTATAAGCGGATGTACGGCAAGACACTGTACCACGATATCACGGGAGATACTTCTGG F3B;FCD;=GGCEGCCD:CADCHC8/HDCDD>ECC<FBD;4BDCBA;-GB??-BBF9ACAFHEDFDA/@EBF?:=C7+C:DDC88A<     AS:i:-3 ZS:i:-3 XN:i:0  XM:i:1  XO:i:0      XG:i:0  NM:i:1  MD:Z:52T34      YS:i:0  YT:Z:CP XS:A:+  NH:i:5
V350189085L1C001R00100098693_TTTGAGGATTAG       147     chr14   26165399        1       80M     =       26163576   -157     ATACTTCTGGAGACTATCGAAAGATTCTACTGAAGATCTGTGGTGGCAACGACTGAGCTGTGGCAGTGGCTCACCTGTGC        A?4D,>E>FFDE@<1A@3GBA>EB?A90A6=EABD<@@DG;FF3FGEC@?G@E>GBFD?G7HF?DGEFFF=9>0C>FCC8    AS:i:0  ZS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0      MD:Z:80 YS:i:-3 YT:Z:CP NH:i:5
V350189085L1C001R00100098693_TTTGAGGATTAG       403     chr14   26165399        1       80M     =       26023962   -139771  ATACTTCTGGAGACTATCGAAAGATTCTACTGAAGATCTGTGGTGGCAACGACTGAGCTGTGGCAGTGGCTCACCTGTGC        A?4D,>E>FFDE@<1A@3GBA>EB?A90A6=EABD<@@DG;FF3FGEC@?G@E>GBFD?G7HF?DGEFFF=9>0C>FCC8    AS:i:0  ZS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0      MD:Z:80 YS:i:-3 YT:Z:CP NH:i:5
V350189085L1C001R00100098693_TTTGAGGATTAG       403     chr14   26165399        1       80M     =       25884192   -279541  ATACTTCTGGAGACTATCGAAAGATTCTACTGAAGATCTGTGGTGGCAACGACTGAGCTGTGGCAGTGGCTCACCTGTGC        A?4D,>E>FFDE@<1A@3GBA>EB?A90A6=EABD<@@DG;FF3FGEC@?G@E>GBFD?G7HF?DGEFFF=9>0C>FCC8    AS:i:0  ZS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0      MD:Z:80 YS:i:-3 YT:Z:CP NH:i:5

Output reads:


V350189085L1C001R00100098693_TTTGAGGATTAG       355     chr14   25884192        1       73M1746N14M     =       26165399    279541  CGAGCTTGACCTGCTGGATATCCGAGCAGAGTATAAGCGGATGTACGGCAAGACACTGTACCACGATATCACGGGAGATACTTCTGG F3B;FCD;=GGCEGCCD:CADCHC8/HDCDD>ECC<FBD;4BDCBA;-GB??-BBF9ACAFHEDFDA/@EBF?:=C7+C:DDC88A<     AS:i:-3 ZS:i:-3 XN:i:0  XM:i:1  XO:i:0      XG:i:0  NM:i:1  MD:Z:52T34      YS:i:0  YT:Z:CP XS:A:+  NH:i:5
V350189085L1C001R00100098693_TTTGAGGATTAG       355     chr14   26023962        1       73M1746N14M     =       26165399    139771  CGAGCTTGACCTGCTGGATATCCGAGCAGAGTATAAGCGGATGTACGGCAAGACACTGTACCACGATATCACGGGAGATACTTCTGG F3B;FCD;=GGCEGCCD:CADCHC8/HDCDD>ECC<FBD;4BDCBA;-GB??-BBF9ACAFHEDFDA/@EBF?:=C7+C:DDC88A<     AS:i:-3 ZS:i:-3 XN:i:0  XM:i:1  XO:i:0      XG:i:0  NM:i:1  MD:Z:52T34      YS:i:0  YT:Z:CP XS:A:+  NH:i:5
V350189085L1C001R00100098693_TTTGAGGATTAG       147     chr14   26165399        1       80M     =       26163576   -157     ATACTTCTGGAGACTATCGAAAGATTCTACTGAAGATCTGTGGTGGCAACGACTGAGCTGTGGCAGTGGCTCACCTGTGC        A?4D,>E>FFDE@<1A@3GBA>EB?A90A6=EABD<@@DG;FF3FGEC@?G@E>GBFD?G7HF?DGEFFF=9>0C>FCC8    AS:i:0  ZS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0      MD:Z:80 YS:i:-3 YT:Z:CP NH:i:5
IanSudbery commented 3 months ago

You will see that the first two reads both point to the third read as their mate. In SAM specification it says that a read should always point to the primary alignment of its mates, even if that read does not point to it. In theory both primary and supplementary alignments of read1 should point to the primary alignment of read2 and vica versa. However, only have 50% of aligners follow this rule.

The case you post above would be solved by running prepare-for-rsem - the third read would be duplicated so that one copy pointed to the first read, and one to the second, and the files sorted so that it always went read1, read2, read1, read2 etc.

hazmup commented 3 months ago

I see, thank you again, I will check some more to see if this fixes all of my problematic reads and get back if needed. I am marking it closed for now if that's ok.