fulcrumgenomics / fgbio

Tools for working with genomic and high throughput sequencing data.
http://fulcrumgenomics.github.io/fgbio/
MIT License
309 stars 67 forks source link

CopyUmiFromReadName should support reverse-complemented UMIs prefixed by "r" #957

Open msto opened 7 months ago

msto commented 7 months ago

Problem

Some UMIs produced by BCL convert are prefixed with "r", indicating "reverse complement."^1

When sequencing a run with Unique Molecular Identifier (UMI) situated in the index2 (i5) on a NextSeq1000/2000 instrument, BCL Convert will put a leading "r" in front of the reverse-complemented UMI in the FASTQ header.

CopyUmiFromReadName enforces that the UMI sequence contains only valid bases (A/C/G/T/N^2) or a delimiter between multiple UMIs (+ or -). UMIs prefixed with "r" fail this validation.

Proposed solution

I think it would be sensible to add the following features to CopyUmiFromReadName: