fulcrumgenomics / fgbio

Tools for working with genomic and high throughput sequencing data.
http://fulcrumgenomics.github.io/fgbio/
MIT License
311 stars 67 forks source link

adding RX tag to bam files, if no fastq available #915

Closed LeilyR closed 1 year ago

LeilyR commented 1 year ago

Hi, I would like to add RX flag to my bam files and I have no access to fastq files, therefore cannot use AnnotateBamWithUmis. Do you have another tool which could take care of that. Otherwise I will add them manually or making fastq from bam to be able to use the annotate tool.

Here is an example of a record in my bam file: A00187:605:H5LWFDSX2:1:1101:1000:19132:UMI_AGTCT_TGTCT 1171 11 78380342 60 145M = 78380199 -288 GCTCCATGGCAAAGAGGTGTCCTTGCAAGTCGTAGTAGAGGGAGGTGATCTCAGAGCTGGAGTGGTTGTACAGGTGGGTGACCTTGGTGGGGTTGGTCAGGTCTGCATAGAAGAACTGCAGGTGGTGGCTGTGGCTGCTCTTGCT FFFFF:::FFFF:F,FF:FFFFF,FFF,FFFFFFFFFFFFF:FFFFFFF:FFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFF:FF:F:FFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF MC:Z:145M MD:Z:145 PG:Z:MarkDuplicates RG:Z:TUMOR NM:i:0 MQ:i:60 AS:i:145 XS:i:0

Thanks a lot

nh13 commented 1 year ago

@LeilyR are the UMIs in the read name? For example, there are two UMIs for the above read AGTCT and TGTCT so the output RX tag you want is RX:Z:AGTCT-TGTCT.

I think you could use CopyUmiFromReadName but your output RX tag would be RX:Z: UMI_AGTCT_TGTCT. See the following from the docs linked above:

The read name is split on : characters with the last field is assumed to be the UMI sequence. The UMI will be copied to the RX tag as per the SAM specification. If any read does not have a UMI composed of valid bases (ACGTN), the program will report the error and fail.

I think then you'd have to be careful how you want to use the RX tag later. For example, it wont work as input to UMI grouping for duplex consensus reads.

I think you could make a quick custom tool in Python in pysam in not much effort though.

LeilyR commented 1 year ago

Hi Nils, They are indeed in the name. So if I got you right, I can use CopyUmiFromReadName first to make RX tags and then use some homemade script to replace the with - and get rid of UMI . Then I could use it afterwards to run groupping tool right?

nh13 commented 1 year ago

If it was me, I would just have the script do it, since setting tags and read names on with pysam is pretty easy to do it in one step.

LeilyR commented 1 year ago

You are right, thanks a lot for the great tip!