genetronhealth / uvc

UVC, a very accurate small-variant caller (https://doi.org/10.1093/bib/bbab458)
BSD 3-Clause "New" or "Revised" License
13 stars 4 forks source link

Preprocessing BAMS to get originalName#UMI format #11

Open BrettLiddell opened 6 months ago

BrettLiddell commented 6 months ago

Hello,

To get my reads into the originalName#UMI format in a bam file, I am running: 1) picard FastqToSam 2) fgbio ExtractUMIsFromBam (get reads in originalName#UMI format) 3) picard SamToFastq 4) bwa (for alignment) However, when using fgbio ExtractUMIsFromBam with --annotate-read-names set to true, the UMI tag is appended to the QNAME but with a + instead of a #. Although I can reformat the bams, I was wondering if there were any alternative pre-processing steps or tools that should be used prior to running UVC?

genetronhealth commented 6 months ago

Hi @BrettLiddell

I added a new command-line option which implemented your feature request. With the --dedup-barcode-begin-char option, you can specify the character to be used for signaling the beginning of UMI sequence in read names. For example, in your case, " --dedup-barcode-begin-char + " should do the job.

Please be aware that, by default, "+" is used as the character that signals the separation between the two parts of a duplex UMI, so --dedup-barcode-duplex-sep-char should be set according if duplex UMI is used.

BrettLiddell commented 6 months ago

Thank you for implementing that as a feature! I'll try it out soon.

genetronhealth commented 6 months ago

Alright, if you have any other question, please let me know.

genetronhealth commented 6 months ago

Hi @BrettLiddell

Do you have any other question for this issue? If not, then I will close this issue soon.