CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
491 stars 190 forks source link

umi in 5 end of both R1 and R2 #477

Closed worker000000 closed 1 year ago

worker000000 commented 3 years ago

Dear professor. Thanks a lot for your powerful tool 1 In paired-end mode, it will ignore the UMI of the second read. so will it affect the accuracy of data, such as false positive variants just in one strand, why not use both umi, is there any inner reason,

<> <> 2 my umi is 5 base umi, it is in the 5 end of reads1 and reads2, the first base of umi is always low quality, so it needs to be removed, the last base of umi is a constant base(which is for T/A ligation)

I tried to use umi_tools extract like this umi_tools extract --bc-pattern=CNNNC --bc-pattern2=CNNNC --log=processed.log -I t_1.fq.gz -S out.R1_TMP_umitools.fq.gz --read2-in=t_2.fq.gz --read2-out=out.R2_TMP_umitools.fq.gz but the header for the mate read in reads1 and reads is like such <> @A00582:632:H7F23DSX2:3:1101:4399:1251_TTCT_ACCTAA 1:N:0:GCAGCTGT+GCTCTAGT @A00582:632:H7F23DSX2:3:1101:4399:1251_TTCT_ACCTAA 2:N:0:GCAGCTGT+GCTCTAGT <> which is not what I expected, in umi_tools, where C = cell barcode, N = umi, P = plate, X=read sequence, is there any error of my command <> <> image

<> <> 3 I want to know if umi_tools is suitable for umi in both reads, because tools like gencore , UMICollapse can not handle umi in both reads, which may arise false postive variant in one strand?

TomSmithCGAT commented 3 years ago

Hello.

In future, please be clear when you are cross posting. Especially if you dive straight mid way into another issue. @IanSudbery - This is a cross-post from: https://github.com/Daniel-Liu-c0deb0t/UMICollapse/issues/10

As per @Daniel-Liu-c0deb0t's replies, no professors here either...

I think the confusion above stems from the format of the header post extraction. For paired end reads, the barcodes (Cell or UMI) are concatenated. So in the above example, the cell is TT (read1) + CT (read2) and the UMI is ACC (read1) + TAA (read2).

However, I'm not sure that's what you want.You say:

my umi is 5 base umi, it is in the 5 end of reads1 and reads2, the first base of umi is always low quality, so it needs to be removed, the last base of umi is a constant base(which is for T/A ligation)

It sounds like you don't want to treat the first and last bases in the pattern as 'cell barcodes', as you have done, but rather bases to be skipped. In which case, your patterns should be XNNNX. See here for more details on barcode patterns.

IanSudbery commented 3 years ago

UMI-tools should work perfectly fine with your read design, and your command is correct as far as I'm aware. The only thing I would probably do is discard your first and last bases rather than make them the CB. So, both barcodes would be "XNNNX".

What would you expect from your command if it was doing what you thought it would?

TomSmithCGAT commented 3 years ago

+1 to Ian for succinctness. -1 for tardiness :wink:

worker000000 commented 3 years ago

@IanSudbery @TomSmithCGAT Thanks a lot, I am sorry for post in 2 repo. my data is for tumor ngs dna sequence, not single cell after using tools like fastp, I should have remove adapter in my reads, so next step is to extract umi, I want to exttract the first 5 base(or to to remove from reads first), then the inner three base is treated as umi to be added into the header, this is what I want. <>

<> origin reads1 @A00582:632:H7F23DSX2:3:1101:4399:1251 1:N:0:GCAGCTGT+GCTCTAGT TACCTCGCCATCCTGTCCAATGAGCATGGCTCCTACAGGTACACGGAGTTCCTGACGGGCCTGGGCCGGCTCATCGAGCTGAAGGACTGCCAGCCGGACAAGGTGTACCTGGGAGGCCTGGACGTGTGTGGTGAGGACGGCCAGTTCACC

origin reads2 @A00582:632:H7F23DSX2:3:1101:4399:1251 2:N:0:GCAGCTGT+GCTCTAGT CTAATGGGTAGGCGCCAGGCCGTACCTTGCATGATGTCATCGTGCCAGCAGTAGGTGAACTGGCCGTCCTCACCACACACGTCCAGGCCTCCCAGGTACACCTTGTCCGGCTGGCAGTCCTTCAGCTCGATGAGCCGGCCCAGGCCCGTC

# # I tried to use umi_tools extract like this umi_tools extract --bc-pattern=CNNNC --bc-pattern2=CNNNC --log=processed.log -I t_1.fq.gz -S out.R1_TMP_umitools.fq.gz --read2-in=t_2.fq.gz --read2-out=out.R2_TMP_umitools.fq.gz but the header for the mate read in reads1 and reads is like such <> @A00582:632:H7F23DSX2:3:1101:4399:1251_TTCT_ACCTAA 1:N:0:GCAGCTGT+GCTCTAGT @A00582:632:H7F23DSX2:3:1101:4399:1251_TTCT_ACCTAA 2:N:0:GCAGCTGT+GCTCTAGT

worker000000 commented 3 years ago

I want to know if umi_tools is suitable for umi in both reads, because tools like gencore , UMICollapse can not handle umi in both reads, which may arise false postive variant in one strand? Thanks a lot

IanSudbery commented 3 years ago

Yes UMI-tools can handle UMIs in both reads.

I'm sorry, I still not quite sure what the problem is:

The command you have run HAS taken the 3 base UMI from each read and added it to the read header. image

worker000000 commented 3 years ago

Thanks a lot because X is sequence, whci maybe T(template) like command ExtractUmisFromBam, so I am a little confused abot C and X what is TTCT, I used to treat TTCT as umi from reads1, and base after seperator _ as umi from reads2, when umitools do umi consensus, does it just use base after seperator , does the bases before seperator _ has any useage?

IanSudbery commented 3 years ago

Bases identified by "C" in the barcode are treated as Cell Barcode. Bases identified as N are treated as UMI, bases identified as X are treated as neither and are left on the read. If a cell barcode is not specified the header will be:

barcode = XNNNX
@header
XNNNXATGGATAGAGATACAG

becomes

@header_NNN
XXATGGATAGAGATACAG

With Cs in the pattern, the following happens:

barcode = CNNNC
@header
CNNNCATGGATAGAGATACAG

becomes

@header_CC_NNN
ATGGATAGAGATACAG

I'm only showing one read, but the same logic applies to pairs.

UMI-tools only uses the last part of the header (NNN) in the UMI-deduplication process. The CC will be ignored in deduplication unless UMI-tools is run with the --per-cell flag.

worker000000 commented 3 years ago

Thanks a lot, so both CNNNC or XNNNX works the same for me, am I right?

https://github.com/CGATOxford/UMI-tools/blob/master/doc/QUICK_START.md

1 I have another important question, since error corection is an important part of umi, so function group in umi_tools is a seperated part for this? or it is wrapped in function dedup?

or should do as following extrract , aligin, sort, group, dedup?

2 in the link, you used bowtie, is it a better choice than bowtie2 or bwa for dna sequence?

IanSudbery commented 3 years ago
  1. If you run dedup, you don't need to run group - dedup does everything group does, and then selects one UMI from each UMI group.
  2. The aligner makes no difference to UMI-tools, use whichever works best for your application - bowtie2 and bwa should both work fine.
worker000000 commented 3 years ago

Thanks a lot 1 so command group is just for the stastic of read groups and read identifiers? 2 I am here still want to confrim with you, both CNNNC or XNNNX works the same for me, am I right?

IanSudbery commented 3 years ago
  1. People use the group command when they want to do other things with the reads downstream than just deduplicate them. For example, perhaps they want to calculate consensus sequences of the reads, or look at allelic ratios. The dedup command just outputs one read per UMI group, the group command outputs all reads in the orignal BAM file, with corrected UMI sequences, so that the user can do whatever they want.
  2. XNNNX will return the XX to the ends of your reads. If you use that you will either have to configure your aligner to allow that, or trim them using a trimmer.
worker000000 commented 3 years ago

Thanks a lot, If I do group, and sent the output bam to dedup, will this get unexpected result?

IanSudbery commented 3 years ago

If you send the output of group to dedup, you will get exactly the same result as if you sent the input directly to dedup without using group.

On Wed, 16 Jun 2021 at 17:22, worker000000 @.***> wrote:

Thanks a lot, If I do group, and sent the output bam to dedup, will this get unexpected result?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CGATOxford/UMI-tools/issues/477#issuecomment-862520841, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJELDSS4QPGOVWHT2I67UTTTDFUTANCNFSM46ZQGF4A .

worker000000 commented 3 years ago

thanks a lot. have you ever compare single umi and pair end umi result, becasue tools like gencore or umicollapse does not hanle pair-end data, but it is fast enough, umi-tools can do pair-end sequence, but not fast as them, people sometimes come with trade-offs, hard to choose

IanSudbery commented 3 years ago

We've never done any rigorous benchmarking, but it stands to reason that longer UMI sequences = more accuracy. I'm not entirely sure what the reasoning is for one 3bp UMI on each read being different from a 6nt UMI on just one of them though.

BTW, I wouldn't have thought you'll have much trouble, run-time wise with a 6nt UMI - I suspect that even if all UMIs are used, UMI-tools still shouldn't struggle because there are only 4096 possible UMIs.

worker000000 commented 3 years ago

Thanks a lot for your kind and fast reply. I am here have another question about the arfument。 because --read2-in just described as Filename for read pairs, but not saying specific for reads2. # there are 6 arguments --bc-pattern, --bc-pattern2, --read2-in, --read2-out, -I, -S

does it has a One to one match? --bc-pattern, -I, -S for reads1 --bc-pattern2, --read2-in, --read2-out for reads2

2 if there is only single umi in reads1 or reads2, does it mean I just need to clarify one of the --bc-pattern, for example, if umi just in reads1,

umi_tools extract -I $fq1 --bc-pattern=CNNNC --read2-in=$fq2 --stdout=${sampleID}.R1.umi.fq.gz --read2-out=${sampleID}.R2.umi.fq.gz

if umi just in reads2,

umi_tools extract -I $fq1 --bc-pattern2=CNNNC --read2-in=$fq2 --stdout=${sampleID}.R1.umi.fq.gz --read2-out=${sampleID}.R2.umi.fq.gz

does the deafault value of --bc-pattern2 --bc-pattern is NONE?

worker000000 commented 3 years ago

1 should I trim adapter before I deal with umi(like tools trimmomatic or fastp or cutadpters), I guess it is a must, but I want to confirm it with you,

# 2 tools like umi_tools just remove the reads not needed intead of just adding duplicated tag , like tools samtools markdup or picard markdup?

worker000000 commented 3 years ago

because I also use ngs data for cnv calling, do you think umi data is needed for cnv(for example the dedup function in umitools), I think I just need to extract the umi data, but not needed to do consensus reads, how do you think ?

IanSudbery commented 3 years ago

I agree

On Thu, 17 Jun 2021 at 16:37, worker000000 @.***> wrote:

because I also use ngs data for cnv calling, do you think umi data is needed for cnv(for example the dedup function in umitools), I think I just need to extract the umi data, but not needed to do consensus reads, how do you think ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CGATOxford/UMI-tools/issues/477#issuecomment-863345087, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJELDQHHIYTZFMV6SUUBYLTTIJCRANCNFSM46ZQGF4A .

worker000000 commented 3 years ago

@IanSudbery Thanks a lot why you agree that cnv is not needed to do consensus reads # # can you share your opion about my question

1 should I trim adapter before I deal with umi(like tools trimmomatic or fastp or cutadpters), I guess it is a must, but I want to confirm it with you,

2 tools like umi_tools just remove the reads not needed intead of just adding duplicated tag , like tools samtools markdup or picard markdup?