Closed kvn95ss closed 7 months ago
Hay,
I extracted the UMIs using the previous command, then performed adapter trimming with trim_galore. I wanted to remove polyA so I enabled the option --polyA
I tried using dedup, but was met with this error -
AssertionError: not all umis are the same length(!): 7 - 23
However, when I check the read headers with samtools, I can see that the UMIs are not variable in length. For example, here are a few lines from the bam file.
42:A:A00621:301:HTYHLDRXX:1:2103:8377:16031_AAGCATCA_1:N:0:CTCTCTAC+CTCTCTAT_PolyA:42 0 chr1 629338 3 3S37M * 0 0 GGGCTCCTATGAAAAAACTTCCTACCACTCACCCTAGCAT FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:2 HI:i:1 AS:i:36 nM:i:0
42:A:A00621:301:HTYHLDRXX:1:2101:26892:28401_GATAACAA_1:N:0:CTCTCTAC+CTCTCTAT_PolyA:42 0 chr1 631071 3 3S36M1S * 0 0 GGGCTGATGTTCGCCGACCGTTGACTATTCTCTACAAACC F:FFFFFFFFFFFFFFFF:FFFFFFF:FFFFFFFFFFFFF NH:i:2 HI:i:1 AS:i:35 nM:i:0
26:A:A00621:301:HTYHLDRXX:1:2163:7545:6386_GCTCGGAA_1:N:0:CTCTCTAC+CTCTCTAT_PolyA:26 0 chr1 631071 3 3S52M1S * 0 0 GGGCTGATGTTCGCCGACCGTTGACTATTCTCTACAAACCACAAAGACATTGGAAC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFF NH:i:2 HI:i:1 AS:i:51 nM:i:0
40:A:A00621:301:HTYHLDRXX:1:2231:12147:35368_ACGGCCAT_1:N:0:CTCTCTAC+CTCTCTAT_PolyA:40 0 chr1 634376 3 3S38M1S * 0 0 GGGATGACCCACCAATCACATGCCTATCATATAGTAAAACCC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FF NH:i:2 HI:i:1 AS:i:37 nM:i:0
27:A:A00621:301:HTYHLDRXX:1:2163:12147:1736_AGGGCATG_1:N:0:CTCTCTAC+CTCTCTAT_PolyA:27 256 chr1 1869055 0 54M1S * 0 0 GGGCACCTGTAGTCCCAGCTACCTGGGAGGCTGAGGCAGGAGAATGGCATGAACC FFFFFFF,F,F:FFFFFFFFFFFFF::FFFFFF::F:FF,FFFFFF:FFF:FFFF NH:i:6 HI:i:2 AS:i:53 nM:i:0
18:A:A00621:301:HTYHLDRXX:1:2136:3531:32878_AGTGGAAG_1:N:0:CTCTCTAC+CTCTCTAT_PolyA:18 0 chr1 1897607 255 2S62M * 0 0 GGGGGAGGCAGAGGTTGCGGTGAGCTGACATTGCGCTATTGCACTCCAGCCTGGGCAACAAGAG FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:1 HI:i:1 AS:i:61 nM:i:0
15:A:A00621:301:HTYHLDRXX:1:2154:3378:35055_AGTGGAAG_1:N:0:CTCTCTAC+CTCTCTAT_PolyA:15 0 chr1 1897607 255 2S65M * 0 0 GGGGGAGGCAGAGGTTGCGGTGAGCTGACATTGCGCTATTGCACTCCAGCCTGGGCAACAAGAGTGA :FFFFFFFFF:FFF:FFFFFFFFFFFFFFFFF:F,FFFF:FFFFFFFFFFF:FFFFFFFFFFFFFFF NH:i:1 HI:i:1 AS:i:63 nM:i:0
36:A:A00621:301:HTYHLDRXX:1:2107:22354:3161_CCGTCCCC_1:N:0:CTCTCTAC+CTCTCTAT_PolyA:36 272 chr1 2579262 0 5S33M8S * 0 0 TCAGCAGAATAAGATTTTTTTTTTTTTTTTTTTTTTTTTTTTTCCC F,::,,,,FFF,,,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:5 HI:i:5 AS:i:32 nM:i:0
I tried to check all the reads in the bam file with the following commands - samtools view aligned.bam|cut -f1 |cut -f2 -d'_'|less
, essentially using _
as a delimiter and selecting the second column. All the UMIs I could see were 8 bp in length, not variable like the error indicates. Any suggestions?
UMI-tools requires the UMI situated after the last _
in first field of the rename.
I think your readnames are starting off as:
42:A:A00621:301:HTYHLDRXX:1:2103:8377:16031_AAGCATCA 1:N:0:CTCTCTAC+CTCTCTAT
etc
That is, there are two fields, sperated by a space and the UMI is in the correct location at the end of the first. Either trimgalore or your aligner is combining the two fields to give you:
42:A:A00621:301:HTYHLDRXX:1:2103:8377:16031_AAGCATCA_1:N:0:CTCTCTAC+CTCTCTAT_PolyA:42
UMI-tools would read the UMI of this read at PolyA:42
.
IS it possible to run trimmgalore before running umi-tools extract?
Hello,
In our previous run we ran trim_galore first, then umi-tools extract. However, there were many empty reads (Possibly due to the reads being trimmed to UMIs+polyA/G), which caused issues during alignment.
So here's a little break down -
Method 1 Trim_galore with polyA UMI_extract STAR alignment UMI_dedup
Method 2 UMI_extract Trim_galore with polyA STAR alignment UMI_dedup <- error
I want to know if Method 1 would be equivalent to method 2, if yes then we will continue analysis with the data. In nf-core/rna-seq
the trimming happens after UMI extraction, so I thought to model the pipeline in a similar fashion.
Hi @kvn95ss. Sorry for the very slow reply. From your edited original comment, I understand you have resolved this issue now. Is that correct?
This looks very peculiar. I'm processing the data in Rackham server, so the UMI_tools version is 1.0.0. The data was generated via SmartSeq-3 protocol - https://teichlab.github.io/scg_lib_structs/methods_html/SMART-seq_family.html#SMART-seq3, so the barcode is ATTGCGCAATG[8bp UMI] so I used this string pattern -
XXXXXXXXXXXNNNNNNNN
During my last run almost a month ago, I processed some samples with the below command -
umi_tools extract --stdin=input.fastq.gz --bc-pattern=XXXXXXXXXXXNNNNNNNN --log input.log --stdout=input_umi.fastq.gz
However, the last time I ran this, the input data had its adapter trimmed. Could this be the reason the pattern didn't work?
EDIT - I tried this regex pattern
--bc-pattern="(?P<discard_1>ATTGCGCAATG)(?P<umi_1>.{8})"
and was able to extract UMIs. My only guess why it didn't work before could be because of adapter trimming? As that was the only difference between the inputs.