RWilton / Arioc

Arioc: GPU-accelerated DNA short-read alignment
BSD 3-Clause "New" or "Revised" License
55 stars 8 forks source link

Append fastq comment to SAM output like bwa-mem -C option #21

Closed karlkashofer closed 2 years ago

karlkashofer commented 2 years ago

Hi ! I need to preprocess my raw reads to extract molecular barcodes. The preprocessing puts them into the fastq ID line like this: @K00336:237:HL2CKBBXY:1:1101:1539:1349 BC:Z:ATGGTAGC+NGATCGAC ZA:Z:NNNNN ZB:Z:NNNNN RX:Z:NNN-NNN QX:Z:$$$ $$$ GAGGCCCTTTGAATGTAATGAATGTGGGAAATCTTTTGGCAGGAAGTAACAACTCATCCTACATACAAGAACACACACTGGNGANANACC + Running this fastq through arioc leads to incorrect SAM file like: K00336:237:HL2CKBBXY:1:1102:26545:1666 BC:Z:ATGGTAGC+CGATCGAC ZA:Z:NNNNN ZB:Z:CTAT RX:Z:NNN-CTA QX:Z:$$$ JFJ 83 chr11 110579883 60 90= = 110579757 -216 CATGGTGTCCTTCTTTGTATAGGCTGGGCGGCTGCAAGCCTGCCCTGATGAGGGACCGGGCATTCCGGAAACATGGCTGGCATTGCTAAA JJJJJJJFJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ AS:i:180 NM:i:0 MD:Z:90 YT:Z:CP MQ:i:60 Na:i:1 Nb:i:1 RG:Z:RG c3:i:53 YS:i:182

BWA has the -C option which does the following Append FASTA/Q comment to SAM output. This option can be used to transfer read meta information (e.g. barcode) to the SAM output. Note that the FASTA/Q comment (the string after a space in the header line) must conform the SAM spec (e.g. BC:Z:CGTAC). Malformated comments lead to incorrect SAM output.

Is there any way to make Arioc behave like bwa-mem with -C option ? I.e. append the fastq comment to the end of the sam line ?

RWilton commented 2 years ago

Yes, this looks doable.

Can you please email me directly a FASTQ file containing a few reads whose deflines are formatted with the information you want to copy through to SAM output? Please accompany this with a corresponding SAM file containing exactly the information you expect to see.

karlkashofer commented 2 years ago

Dear Richard !

I successfully tested the new SAMtags feature on Agilent HS XT2 molecular barcoded libraries. Using the AGent Trimmer to generate deflines with the MolBar information, then aligning with Arioc and then deduplicating with AGent Locatit now works. I used this dataIn definition:

<dataIn sequenceType="Q" QNAME="(*)\t*" SAMtags="*\t(*)">

Together with the suppressRaggSQ feature i now have all tools in place for a more detailed analysis of Arioc in our Exome workflow.

Thank you very much !

RWilton commented 2 years ago

Thank you for confirming that this new functionality does what you expect it to do!