Closed ahwanpandey closed 4 years ago
The paired
method is intended for duplex sequencing applications where the two ends of the molecule (seen in R1 and R2 respectively) each contain a UMI and the UMIs are applied in such a way during library construction that the top and bottom strand of the same molecule end up with the same UMIs (albeit inverted between R1 and R2). From the usage documentation for GroupReadsByUmi
:
4. paired: similar to adjacency but for methods that produce template with a pair of UMIs such that a read with A-B
is related to but not identical to a read with B-A. Expects the pair of UMIs to be stored in a single tag, separated
by a hyphen (e.g. 'ACGT-CCGG'). The molecular IDs produced have more structure than for single UMI strategies, and
are of the form '{base}/{AB|BA}'. E.g. two UMI pairs would be mapped as follows AAAA-GGGG -> 1/AB, GGGG-AAAA -> 1/BA.
I.e. it expects reads to have UMIs of two equals parts separated by a hypen., which it would appear your UMIs do not.
Hello, I stumbled onto the same issue described by @ahwanpandey. In my application, I am only interested in a very particular region of the genome (less than 1 kb), so I start my pipeline by subsetting to that region. I do this with a simple call to samtools view
but that tools is not aware of read pairs. What ends up happening is that reads on the edges of my region of interest sometimes have mates that are outside of my region and therefore get excluded from my subsetted bam file. These missing mates that are outside of my region of interest end up triggering the error in fgbio.
Other tools deal with this issue by including a command line option to ignore reads with missing mates (e.g. FixMateInformation --IGNORE_MISSING_MATES ). Could a similar option be appropriate for the GroupReadsByUmi
command in fgbio?
For now I need to go in and manually filter out the reads on the edge of my region where the mate is missing (or filter in the missing mates even though there are outside of my region). Not a huge deal, but also not as easy as it could be.
EDIT 00: alternatively, I can just use the -P option in samtools view
to make it "mate-aware". Doing so solved my particular issue.
EDIT 01: I also had to use the -F 2048
flag, otherwise GroupReadsByUmi
would sometimes complain, java.lang.IllegalStateException: <readName> did not have a primary R1 record.
. Also, of note, adding the -P option is much, much slower (especially when working on a remote bam file).
CONCLUSION: It would be great if GroupReadsByUmi
had something like --IGNORE_MISSING_MATES
. It could be set to false, by default (of course) to preserve backwards compatibility.
Hello,
I wonder why I am getting this error when using "paired"? If try "adjacency" it works.
Thanks.