CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
491 stars 190 forks source link

Adding UMI to a tag on both reads in a bam file #441

Closed SPPearce closed 4 years ago

SPPearce commented 4 years ago

Hi UMI-tools team,

I'm trying to call molecular consensus reads from bam files where I have the UMI on the read name, using the fgbio toolkit. This tool expects the reads to be given in the RX: tag of the bam files, which I am able to do using umi_tools group --umi-group-tag "RX". However, this only puts the RX: tag on the R1 of each bam file, not in the R2, and fgbio GroupReadsByUmi still fails.

Is there a way to add the tag to both reads, rather than just the R1?

Thanks, Simon

IanSudbery commented 4 years ago

One solution would be to sort the reads by name at the end of the process and transfer the RX tag from the read1 to the read2. The problem comes when you have two read1s pointing to a single read2, and you might have read2s that are not pointed to by any read1. Perhaps a mode or a tool that did this, but required non-multimapped, primary alignments only?

SPPearce commented 4 years ago

Yes, although I couldn't find a tool that actually manipulated tags directly, short of manually doing it read by read in pysam etc.

I have however found a solution (for my purposes at least). bwa mem has an option -C to take any "comments" from the fastq header and assign them to the reads in the aligned sam file. So I'm now using sed to move the UMI from being the end of the readname to a "comment" after (e.g. zcat ${R1FASTQ} | sed "s/_\([ACGTN]*\)/ RX:Z:\\1/g" to make it a valid sam tag as expected for the sam file. This appears to be working for me at the moment.

IanSudbery commented 4 years ago

Yes, I think this would have to be a done read by read by pysam (any tool you used would only be doing the same). Of course your solution works find to add the uncorrected UMI as a tag, but not a corrected one.

Under normal circumstances I'd be happy to knock something together for this, but I'm curerntly completely snowed under with teaching.

SPPearce commented 4 years ago

Sure, that is perfectly understandable. The fgbio toolkit has options to do the correction of the UMIs, so I'll use that for now.

TomSmithCGAT commented 4 years ago

@SPPearce - Can we close this issue?

SPPearce commented 4 years ago

Hi Tom,

Yes, you can close this. Thanks, Simon