CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
472 stars 188 forks source link

umi_tools not "extracting" any barcodes, while the same command had worked before #612

Closed kvn95ss closed 2 months ago

kvn95ss commented 7 months ago

This looks very peculiar. I'm processing the data in Rackham server, so the UMI_tools version is 1.0.0. The data was generated via SmartSeq-3 protocol - https://teichlab.github.io/scg_lib_structs/methods_html/SMART-seq_family.html#SMART-seq3, so the barcode is ATTGCGCAATG[8bp UMI] so I used this string pattern - XXXXXXXXXXXNNNNNNNN

During my last run almost a month ago, I processed some samples with the below command - umi_tools extract --stdin=input.fastq.gz --bc-pattern=XXXXXXXXXXXNNNNNNNN --log input.log --stdout=input_umi.fastq.gz

However, the last time I ran this, the input data had its adapter trimmed. Could this be the reason the pattern didn't work?

EDIT - I tried this regex pattern --bc-pattern="(?P<discard_1>ATTGCGCAATG)(?P<umi_1>.{8})" and was able to extract UMIs. My only guess why it didn't work before could be because of adapter trimming? As that was the only difference between the inputs.

kvn95ss commented 7 months ago

Hay,

I extracted the UMIs using the previous command, then performed adapter trimming with trim_galore. I wanted to remove polyA so I enabled the option --polyA

I tried using dedup, but was met with this error -

AssertionError: not all umis are the same length(!): 7 - 23

However, when I check the read headers with samtools, I can see that the UMIs are not variable in length. For example, here are a few lines from the bam file.

42:A:A00621:301:HTYHLDRXX:1:2103:8377:16031_AAGCATCA_1:N:0:CTCTCTAC+CTCTCTAT_PolyA:42   0       chr1    629338  3       3S37M   *       0       0       GGGCTCCTATGAAAAAACTTCCTACCACTCACCCTAGCAT          FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF        NH:i:2  HI:i:1  AS:i:36 nM:i:0
42:A:A00621:301:HTYHLDRXX:1:2101:26892:28401_GATAACAA_1:N:0:CTCTCTAC+CTCTCTAT_PolyA:42  0       chr1    631071  3       3S36M1S *       0       0       GGGCTGATGTTCGCCGACCGTTGACTATTCTCTACAAACC          F:FFFFFFFFFFFFFFFF:FFFFFFF:FFFFFFFFFFFFF        NH:i:2  HI:i:1  AS:i:35 nM:i:0
26:A:A00621:301:HTYHLDRXX:1:2163:7545:6386_GCTCGGAA_1:N:0:CTCTCTAC+CTCTCTAT_PolyA:26    0       chr1    631071  3       3S52M1S *       0       0       GGGCTGATGTTCGCCGACCGTTGACTATTCTCTACAAACCACAAAGACATTGGAAC  FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFF        NH:i:2  HI:i:1  AS:i:51 nM:i:0
40:A:A00621:301:HTYHLDRXX:1:2231:12147:35368_ACGGCCAT_1:N:0:CTCTCTAC+CTCTCTAT_PolyA:40  0       chr1    634376  3       3S38M1S *       0       0       GGGATGACCCACCAATCACATGCCTATCATATAGTAAAACCC        FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FF      NH:i:2  HI:i:1  AS:i:37 nM:i:0
27:A:A00621:301:HTYHLDRXX:1:2163:12147:1736_AGGGCATG_1:N:0:CTCTCTAC+CTCTCTAT_PolyA:27   256     chr1    1869055 0       54M1S   *       0       0       GGGCACCTGTAGTCCCAGCTACCTGGGAGGCTGAGGCAGGAGAATGGCATGAACC   FFFFFFF,F,F:FFFFFFFFFFFFF::FFFFFF::F:FF,FFFFFF:FFF:FFFF NH:i:6  HI:i:2  AS:i:53 nM:i:0
18:A:A00621:301:HTYHLDRXX:1:2136:3531:32878_AGTGGAAG_1:N:0:CTCTCTAC+CTCTCTAT_PolyA:18   0       chr1    1897607 255     2S62M   *       0       0       GGGGGAGGCAGAGGTTGCGGTGAGCTGACATTGCGCTATTGCACTCCAGCCTGGGCAACAAGAG  FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF        NH:i:1  HI:i:1  AS:i:61 nM:i:0
15:A:A00621:301:HTYHLDRXX:1:2154:3378:35055_AGTGGAAG_1:N:0:CTCTCTAC+CTCTCTAT_PolyA:15   0       chr1    1897607 255     2S65M   *       0       0       GGGGGAGGCAGAGGTTGCGGTGAGCTGACATTGCGCTATTGCACTCCAGCCTGGGCAACAAGAGTGA       :FFFFFFFFF:FFF:FFFFFFFFFFFFFFFFF:F,FFFF:FFFFFFFFFFF:FFFFFFFFFFFFFFF     NH:i:1  HI:i:1  AS:i:63 nM:i:0
36:A:A00621:301:HTYHLDRXX:1:2107:22354:3161_CCGTCCCC_1:N:0:CTCTCTAC+CTCTCTAT_PolyA:36   272     chr1    2579262 0       5S33M8S *       0       0       TCAGCAGAATAAGATTTTTTTTTTTTTTTTTTTTTTTTTTTTTCCC    F,::,,,,FFF,,,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF  NH:i:5  HI:i:5  AS:i:32 nM:i:0

I tried to check all the reads in the bam file with the following commands - samtools view aligned.bam|cut -f1 |cut -f2 -d'_'|less, essentially using _ as a delimiter and selecting the second column. All the UMIs I could see were 8 bp in length, not variable like the error indicates. Any suggestions?

IanSudbery commented 7 months ago

UMI-tools requires the UMI situated after the last _ in first field of the rename.

I think your readnames are starting off as:

42:A:A00621:301:HTYHLDRXX:1:2103:8377:16031_AAGCATCA 1:N:0:CTCTCTAC+CTCTCTAT

etc

That is, there are two fields, sperated by a space and the UMI is in the correct location at the end of the first. Either trimgalore or your aligner is combining the two fields to give you:

42:A:A00621:301:HTYHLDRXX:1:2103:8377:16031_AAGCATCA_1:N:0:CTCTCTAC+CTCTCTAT_PolyA:42

UMI-tools would read the UMI of this read at PolyA:42.

IS it possible to run trimmgalore before running umi-tools extract?

kvn95ss commented 7 months ago

Hello,

In our previous run we ran trim_galore first, then umi-tools extract. However, there were many empty reads (Possibly due to the reads being trimmed to UMIs+polyA/G), which caused issues during alignment.

So here's a little break down -

Method 1 Trim_galore with polyA UMI_extract STAR alignment UMI_dedup

Method 2 UMI_extract Trim_galore with polyA STAR alignment UMI_dedup <- error

I want to know if Method 1 would be equivalent to method 2, if yes then we will continue analysis with the data. In nf-core/rna-seq the trimming happens after UMI extraction, so I thought to model the pipeline in a similar fashion.

TomSmithCGAT commented 4 months ago

Hi @kvn95ss. Sorry for the very slow reply. From your edited original comment, I understand you have resolved this issue now. Is that correct?