Daniel-Liu-c0deb0t / UMICollapse

Accelerating the deduplication and collapsing process for reads with Unique Molecular Identifiers (UMI). Heavily optimized for scalability and orders of magnitude faster than a previous tool.
MIT License
62 stars 8 forks source link

can umicollapse be used for single umi and duplex umi data #10

Open worker000000 opened 3 years ago

worker000000 commented 3 years ago

Daer professor, Thanks a lot for making such a super tool, can it be used for single umi and duplex umi data?

worker000000 commented 3 years ago

always we have data of fastq, and I saw this tools supply fastq and bam mode <> --merge: method for identifying which UMI to keep out of every two UMIs. Either any, avgqual, or mapqual. Default: mapqual for SAM/BAM mode, avgqual for FASTQ mode. <>

can you share a full command of this?

<> <> --paired: use paired-end mode, which deduplicates pairs of reads from a SAM/BAM file. The template length of each read pair, along with the alignment coordinate and UMI of the forwards read, are used to deduplicate read pairs. This is very memory intensive, and the input SAM/BAM files should be sorted. Default: false (single-end).

hows should I prepare this bam? first extract umi from fastq and bwa mem and sort? or just bwa mem and sort?

Daniel-Liu-c0deb0t commented 3 years ago

FASTQ and BAM data are processed differently.

FASTQ data is deduplicated based on the entire read. This mode does not support paired-end reads. This mode is used to deduplicate data without having to align to a reference. Aligning first is time-consuming, but it may give better results.

Here is an example of the merge flag:

./umicollapse fastq -i input.fastq -o output.fastq --merge avgqual

However, this is redundant because by default avgqual is used for FASTQ mode. This means that the read with the highest average quality score is the only one that is output when collapsing a group of multiple reads.

For BAM mode, two major steps have to be done before running UMICollapse:

  1. Extract UMIs from reads in FASTQ format and add it to the headers
  2. Align reads to get a BAM file

UMICollapse does not do these two steps. You should follow the instructions here: https://umi-tools.readthedocs.io/en/latest/QUICK_START.html The only difference is using UMICollapse instead of UMI-tools. UMICollapse should be much faster than UMI-tools, but it should produce very similar results.

Let me know if you have any other questions. Also I'm not a professor.

worker000000 commented 3 years ago

Thanks a lot for your qucik and help reply, I have some other question. 1 can umicollapse be used for single umi and duplex umi data 2 how the autodetect model work and is it correct enough? 3 in my umi mode, a duplex umi(both reads have umi), and the first base of the umi is not in good sequence quality, so we ignore it , and the next three base is my umi of three bases, and the next the the base T (for T A ligation), can I use antodetect mode? 4 in the UMI-tools, it used bowtie, is it better than bowtie2 and bwa?

Daniel-Liu-c0deb0t commented 3 years ago

For the first 3 questions: UMICollapse can handle single UMIs. In paired-end mode, it will ignore the UMI of the second read.

How the UMI is preprocessed is not handled by UMICollapse. You will have to extract the UMIs from the reads, remove the first base, and put this cleaned up UMI in the read header before alignment. The UMI in the header is what is used by UMICollapse. The only thing that is autodetected is the length of this UMI in the header. UMI-tools provides a way to extract UMIs, ignore bases, and put them in the header, based on a certain pattern.

For the fourth question, newer tools are probably better, but I'm not sure.

worker000000 commented 3 years ago

Thanks a lot for your kind and fast reply. 1 In paired-end mode, it will ignore the UMI of the second read. so will it affect the accuracy of data, such as false positive variants just in one strand, why not use both umi, is there any inner reason,

<> <> 2 my umi is 5 base umi, it is in the 5 end of reads1 and reads2, the first base of umi is always low quality, so it needs to be removed, the last base of umi is a constant base(which is for T/A ligation)

I tried to use umi_tools extract like this umi_tools extract --bc-pattern=CNNNC --bc-pattern2=CNNNC --log=processed.log -I t_1.fq.gz -S out.R1_TMP_umitools.fq.gz --read2-in=t_2.fq.gz --read2-out=out.R2_TMP_umitools.fq.gz but the header for the mate read in reads1 and reads is like such <> @A00582:632:H7F23DSX2:3:1101:4399:1251_TTCT_ACCTAA 1:N:0:GCAGCTGT+GCTCTAGT @A00582:632:H7F23DSX2:3:1101:4399:1251_TTCT_ACCTAA 2:N:0:GCAGCTGT+GCTCTAGT <> which is not what I expected, in umi_tools, where C = cell barcode, N = umi, P = plate, X=read sequence, is there any error of my command <> <>

image

Daniel-Liu-c0deb0t commented 3 years ago

It seems that 2 is answered by the excellent maintainers behind UMI-tools.

For 1, UMICollapse was originally created as a proof of concept for better algorithms for deduplicating many, many UMIs. This meant that not all features were implemented, only the most important ones with single-end sequences. Later, more features were added due to user request, but it is still not as feature-complete as UMI-tools, which has existed for a long time. I would recommend UMICollapse only for cases where they encounter issues with other tools on massive datasets. I agree with the UMI-tools maintainers that with only 6bp UMIs, there wouldn't be a lot of UMIs to deduplicate.

For your case, if you really wanted to use UMICollapse, there is a workaround where you extract the UMIs from both reads and place them in the header of the first read, then deduplicate using paired end mode.

worker000000 commented 3 years ago

thanks a lot, so I need to remove the umi in the reads2, is that right?

Daniel-Liu-c0deb0t commented 3 years ago

Ideally, you would remove the UMI from read2 and concatenate it to the UMI of read1 (to form a 6bp UMI) and place this UMI in the read1 FASTQ headers.

worker000000 commented 3 years ago

is there any tools to remove this effiently, thanks a lot

Daniel-Liu-c0deb0t commented 3 years ago

Perhaps you can do it with UMI-tools? They have a way of extracting UMIs from read1 and read2 and putting them in the respective headers.

If you want to concatenate the UMIs and put them in the read1 header, then you may have to write a simple script to do it. I don't think UMI-tools can handle that.

worker000000 commented 3 years ago

thanks a lot, do you mean do as following? # umi_tools extract --bc-pattern=CNNNC --bc-pattern2=CNNNC --log=processed.log -I 28_1.fq.gz -S R1_TMP_umitools.fq.gz --read2-in=28_2.fq.gz --read2-out=R2_TMP_umitools.fq.gz

#

delete cell_code and umi from fq2

zless R2_TMPumitools.fq.gz | sed -r 's#(@[^]+)_[^ ]+( 2:N:0)#\1\2#' | pigz - > n2.fq.gz

bwa and samtools

# ./umicollapse bam -i paired_example.bam -o dedup_pairedexample.bam --umi-sep --paired --two-pass

worker000000 commented 3 years ago

can you have a look of my issuses in this https://github.com/CGATOxford/UMI-tools/issues/477 , thanks a lot

Daniel-Liu-c0deb0t commented 3 years ago

Are you removing the UMIs in the FASTQ headers for read2? You do not need to do that. You only need to extract the UMI from the read sequences so it does not interfere with alignment (this was what I meant by "remove" in my previous comments; the UMIs need to be removed from the sequences, but not the headers). UMICollapse simply ignores the UMIs that are in the headers of the read2 FASTQ files, so there is no need to remove them. Sorry, I was not very clear on this.

I hate to say this but I can't write your pipeline for you. I can only provide help related to this tool, so if you have more general concerns I suggest asking on biostars or something.

worker000000 commented 3 years ago

Are you removing the UMIs in the FASTQ headers for read2? You do not need to do that. You only need to extract the UMI from the read sequences so it does not interfere with alignment (this was what I meant by "remove" in my previous comments; the UMIs need to be removed from the sequences, but not the headers). UMICollapse simply ignores the UMIs that are in the headers of the read2 FASTQ files, so there is no need to remove them. Sorry, I was not very clear on this.

I hate to say this but I can't write your pipeline for you. I can only provide help related to this tool, so if you have more general concerns I suggest asking on biostars or something.

Thanks a lot, so your meanning is that, do as pair-end mode, so when you said just use umi consensus of reads1, after you remove many error reads in reads1, how will you treat the mate reads in reads2, I am curious about this

Daniel-Liu-c0deb0t commented 3 years ago

When you pass in the --paired flag, any read1 that is removed will cause its corresponding read2 to be removed too. (Same behavior as UMI-tools)

worker000000 commented 3 years ago

When you pass in the --paired flag, any read1 that is removed will cause its corresponding read2 to be removed too. (Same behavior as UMI-tools)

Thanks a lot fot your helpful answer