fulcrumgenomics / fgbio

Tools for working with genomic and high throughput sequencing data.
http://fulcrumgenomics.github.io/fgbio/
MIT License
315 stars 67 forks source link

GroupReadsByUmi [Paired strategy used but umi did not contain 2 segments] #570

Closed ahwanpandey closed 4 years ago

ahwanpandey commented 4 years ago

Hello,

I wonder why I am getting this error when using "paired"? If try "adjacency" it works.

Thanks.

java -jar -Xmx1G $FGBIO_JAR GroupReadsByUmi --input=$PROJECT_DIR/Sample_$SAMPLE_NAME/Alignments/$SAMPLE_NAME.mapped.all.bam --output=$PROJECT_DIR/Sample_$SAMPLE_NAME/Alignments/$SAMPLE_NAME.mapped.all.grouped.bam --strategy=paired --edits=0 --min-map-q=20
[2020/03/11 10:43:43 | FgBioMain | Info] Executing GroupReadsByUmi from fgbio version 1.1.0 as apandey@papr-res-compute301.unix.petermac.org.au on JRE 1.8.0_144-b01 with snappy, IntelInflater, and IntelDeflater
[2020/03/11 10:43:43 | GroupReadsByUmi | Info] Filtering and sorting input.
[2020/03/11 10:43:50 | GroupReadsByUmi | Info] Sorted     1,000,000 records.  Elapsed time: 00:00:06s.  Time for last 1,000,000:    6s.  Last read position: 2:217,279,643
[2020/03/11 10:43:55 | GroupReadsByUmi | Info] Sorted     2,000,000 records.  Elapsed time: 00:00:11s.  Time for last 1,000,000:    4s.  Last read position: 7:68,679,693
[2020/03/11 10:44:00 | GroupReadsByUmi | Info] Sorted     3,000,000 records.  Elapsed time: 00:00:16s.  Time for last 1,000,000:    4s.  Last read position: 7:148,514,515
[2020/03/11 10:44:04 | GroupReadsByUmi | Info] Sorted     4,000,000 records.  Elapsed time: 00:00:21s.  Time for last 1,000,000:    4s.  Last read position: 8:48,694,558
[2020/03/11 10:44:10 | GroupReadsByUmi | Info] Sorted     5,000,000 records.  Elapsed time: 00:00:26s.  Time for last 1,000,000:    5s.  Last read position: 11:94,208,683
[2020/03/11 10:44:15 | GroupReadsByUmi | Info] Sorted     6,000,000 records.  Elapsed time: 00:00:31s.  Time for last 1,000,000:    4s.  Last read position: 12:66,703,544
[2020/03/11 10:44:20 | GroupReadsByUmi | Info] Sorted     7,000,000 records.  Elapsed time: 00:00:36s.  Time for last 1,000,000:    4s.  Last read position: 15:91,328,205
[2020/03/11 10:44:24 | GroupReadsByUmi | Info] Sorted     8,000,000 records.  Elapsed time: 00:00:41s.  Time for last 1,000,000:    4s.  Last read position: 17:59,885,819
[2020/03/11 10:44:28 | GroupReadsByUmi | Info] Accepted 8,856,844 reads for grouping.
[2020/03/11 10:44:28 | GroupReadsByUmi | Info] Filtered out 685,656 reads due to mapping issues.
[2020/03/11 10:44:28 | GroupReadsByUmi | Info] Filtered out 9,750 reads that contained one or more Ns in their UMIs.
[2020/03/11 10:44:28 | GroupReadsByUmi | Info] Assigning reads to UMIs and outputting.
[2020/03/11 10:44:29 | FgBioMain | Info] GroupReadsByUmi failed. Elapsed time: 0.81 minutes.
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Paired strategy used but umi did not contain 2 segments: TTCGCGGGGG
        at scala.Predef$.require(Predef.scala:340)
        at com.fulcrumgenomics.umi.GroupReadsByUmi.umiForRead(GroupReadsByUmi.scala:591)
        at com.fulcrumgenomics.umi.GroupReadsByUmi.$anonfun$assignUmiGroups$1(GroupReadsByUmi.scala:543)
        at scala.collection.immutable.List.map(List.scala:222)
        at scala.collection.immutable.List.map(List.scala:82)
        at com.fulcrumgenomics.umi.GroupReadsByUmi.assignUmiGroups(GroupReadsByUmi.scala:543)
        at com.fulcrumgenomics.umi.GroupReadsByUmi.execute(GroupReadsByUmi.scala:500)
        at com.fulcrumgenomics.cmdline.FgBioMain.makeItSo(FgBioMain.scala:148)
        at com.fulcrumgenomics.cmdline.FgBioMain.makeItSoAndExit(FgBioMain.scala:124)
        at com.fulcrumgenomics.cmdline.FgBioMain$.main(FgBioMain.scala:88)
        at com.fulcrumgenomics.cmdline.FgBioMain.main(FgBioMain.scala)
tfenne commented 4 years ago

The paired method is intended for duplex sequencing applications where the two ends of the molecule (seen in R1 and R2 respectively) each contain a UMI and the UMIs are applied in such a way during library construction that the top and bottom strand of the same molecule end up with the same UMIs (albeit inverted between R1 and R2). From the usage documentation for GroupReadsByUmi:

4. paired: similar to adjacency but for methods that produce template with a pair of UMIs such that a read with A-B
     is related to but not identical to a read with B-A. Expects the pair of UMIs to be stored in a single tag, separated
     by a hyphen (e.g. 'ACGT-CCGG'). The molecular IDs produced have more structure than for single UMI strategies, and
     are of the form '{base}/{AB|BA}'. E.g. two UMI pairs would be mapped as follows AAAA-GGGG -> 1/AB, GGGG-AAAA -> 1/BA.

I.e. it expects reads to have UMIs of two equals parts separated by a hypen., which it would appear your UMIs do not.

karlbeutner commented 1 year ago

Hello, I stumbled onto the same issue described by @ahwanpandey. In my application, I am only interested in a very particular region of the genome (less than 1 kb), so I start my pipeline by subsetting to that region. I do this with a simple call to samtools view but that tools is not aware of read pairs. What ends up happening is that reads on the edges of my region of interest sometimes have mates that are outside of my region and therefore get excluded from my subsetted bam file. These missing mates that are outside of my region of interest end up triggering the error in fgbio.

Other tools deal with this issue by including a command line option to ignore reads with missing mates (e.g. FixMateInformation --IGNORE_MISSING_MATES ). Could a similar option be appropriate for the GroupReadsByUmi command in fgbio?

For now I need to go in and manually filter out the reads on the edge of my region where the mate is missing (or filter in the missing mates even though there are outside of my region). Not a huge deal, but also not as easy as it could be.

EDIT 00: alternatively, I can just use the -P option in samtools view to make it "mate-aware". Doing so solved my particular issue.

EDIT 01: I also had to use the -F 2048 flag, otherwise GroupReadsByUmi would sometimes complain, java.lang.IllegalStateException: <readName> did not have a primary R1 record.. Also, of note, adding the -P option is much, much slower (especially when working on a remote bam file).

CONCLUSION: It would be great if GroupReadsByUmi had something like --IGNORE_MISSING_MATES. It could be set to false, by default (of course) to preserve backwards compatibility.