CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
472 stars 188 forks source link

UMIs lost after STAR alignment #617

Closed asadatae closed 4 months ago

asadatae commented 7 months ago

Hi,

Thank you so much for all the resources and help provided in this community!

My PE150 bulk RNAseq read2 has an 8nt UMI at each fragment start. I first extract the UMIs from read2 using umi_tools with the following parameter:

--bc-pattern=NNNNNNNNXXX

Then I trim adapters from both read1 and read2 with trim_galore using auto detection and remove orphan reads before STAR alignment. Each intermediate file retains the 8nt UMI in the name of the fragment like so:

@A00738:734:HHVHMDSX7:1:1101:1217:1000_CACACATA 2:N:0:ATATGGATAT+TAATACAGGT ACCGGGCGACTCTTAGCGGTGGATCACTCGGCTCGTGCGTCGATGAAGAACGCAGCTAGCTGCGAGAATTAATGTGAATTGCAGGACACATTGATCATCGACACTTCGAACGCACTTGCGGCCCCGGG + FFFFFFFFFFFF:FFFFF:FFFF:FFFFFFFFFFFFFFFFFF,,F:FFF,FFFFFFFFF:FFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

But when I inspect the STAR output BAM file, the UMIs are lost:

A00738:734:HHVHMDSX7:1:1128:18285:10394 163 chr1 10001 0 26S109M6S = 10022 227 ACCGGCCCTATCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAGATAGG FFFFFF:FFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFF:F,:FFF:FFFF::FFFFF::FFF,FFFFF,FF:FFF:FF,::FF,FFFFF:F:,:FFFF,:::F:,,FF::,,:F NH:i:8 HI:i:1 AS:i:234 nM:i:1

So when I perform umi_tools dedup, it outputs this error, which I suspect is because the UMIs are not there in the BAM file?: AssertionError: not all umis are the same length(!): 38 - 39

Here are my STAR paremeters:

STAR --runThreadN 8 --genomeDir hg38_star_index --readFilesIn read1 read2 --outFileNamePrefix sample1 --outReadsUnmapped Fastx --outSAMtype BAM SortedByCoordinate --sjdbGTFfile /gencode.v44.chr_patch_hapl_scaff.annotation.gtf

What should I change to retain the UMIs in the BAM alignment file? Any insight would be highly appreciated!

Here's the instructions from the library prep kit I'm using:

Sequencing Analysis Guidelines • Read 1 matches the antisense sequence of the input RNA. • If you are performing paired-end sequencing, Read 2 will correspond to the sense strand. • First eight cycles of Read 2 belong to UMIs followed by 3 nucleotides of UMI-linker and 3 nucleotides derived from the kit • Trim 8 nt UMIs + 3 nt UMI linker + 3 nt from Pico v3 SMART UMI Adapter from Read2 prior to mapping.

IanSudbery commented 7 months ago

What is your command line for trimming the UMI?

asadatae commented 7 months ago

This is what I use:

umi_tools extract --bc-pattern=NNNNNNNNXXX, --stdin fastq2, --stdout umi_extracted_fastq2

IanSudbery commented 7 months ago

When you extract the UMIs from your read2s you need to add those UMIs to the read 1 header, as well as the read2 header:

umi_tools extract --bc-pattern=NNNNNNNNXXX  --stdin fastq2 --stdout umi_extracted_fastq2 --read2in=fastq1 --read2out=umi_extracted_fastq1
asadatae commented 7 months ago

Thank you! That did the trick!