Daniel-Liu-c0deb0t / UMICollapse

Accelerating the deduplication and collapsing process for reads with Unique Molecular Identifiers (UMI). Heavily optimized for scalability and orders of magnitude faster than a previous tool.
MIT License
62 stars 8 forks source link

fastq mode output #6

Closed coffeebond closed 3 years ago

coffeebond commented 3 years ago

Hi,

Thanks for sharing this program.

I have been using umi_tools (https://github.com/CGATOxford/UMI-tools) from Smith et.al (2017) for collapsing my barcodes sequences. The problem is that I have a lot of unique barcodes (each 20nt long, total ~4,000,000) to collapse and it's very slow to use their program. I used their UMIClusterer API and tried it with ~400,000 unique barcodes. The runtime was about ~1 hours. The runtime increased dramatically to 10 hours when I ran it with ~1,000,000 unique barcodes.

I tried your program and it's significantly faster. However, I couldn't get the right information from the output. My input file is a fastq file. I want keep all the reads in the output and know which read belongs to which clustered core sequence. I think the current version only outputs the clustered core sequence and throw away reads that collapsed onto them. Do you think if there is an option in the program for me to achieve this?

Daniel-Liu-c0deb0t commented 3 years ago

Currently, there isn't an option to do this. However, if you tell me exactly what you need, I can look into it and possibly implement it in the tool. You are running UMICollapse with the fastq option on a raw fastq file right? So you want some sort of tag appended to the header of each fastq record with an ID that identifies reads in the same cluster?

coffeebond commented 3 years ago

Thanks for your reply.

Yes. I am using the fastq option. And that's right, If a tag indicating the group in which each read is clustered would be very helpful. How about using the core sequence of each cluster (the one with the most counts)?

I assume the program collapses reads with the exact same sequence first and gets the number of counts for each unique sequence before clustering, right? Do you think you can report that number as well?

Basically, the information I am trying to get is

  1. the unique sequence
  2. the count number of the unique sequence
  3. the primary sequence in which this unique sequence clusters in

I think the current version with fastq option only outputs the primary sequence of each cluster.

Or maybe the program can take in an alternative input, like the UMIClusterer API. I think it would be a text file with unique sequence in one column and the count number of each unique sequence in a second column. This information can be easily get from the raw fastq file.

Hopeful, this is not too much to ask. Let me know what you think. Thank you.

Daniel-Liu-c0deb0t commented 3 years ago

I think this is doable, but it probably requires many changes across the codebase. I'll try to work on it in the next few days.

What do you think about assigning unique numerical IDs for each cluster, so that all reads in a cluster have the same ID? This would potentially be easier to implement and less ambiguous.

coffeebond commented 3 years ago

Thanks!

I think assigning a unique ID to each cluster would also work. I can go through the whole fastq data again, collapse the same sequences and group them by the cluster IDs. Then the one with the most read count in each group would be the primary sequence of the cluster.

Daniel-Liu-c0deb0t commented 3 years ago

I am making good progress on implementing this in the code. I am including cluster ID, cluster size, and the number of reads with the exact same UMI in the FASTQ header. Testing still needs to be done, though. I'm also looking into supporting this feature for SAM/BAM inputs.

coffeebond commented 3 years ago

Sounds good. I think this is going to be very helpful.

Daniel-Liu-c0deb0t commented 3 years ago

I think I've finished adding this feature. To enable tagging reads instead of collapsing them, use the --tag argument. There's more details in the README.md. I haven't thoroughly tested everything, though. If you can, please run the latest .jar file on your dataset and let me know how it goes.

coffeebond commented 3 years ago

Thanks! I tried it but it seems that the new version requires Java 14 instead Java 11 as before. Is there a reason you changed this? Would it be possible to changed it back to Java 11?

Daniel-Liu-c0deb0t commented 3 years ago

Ah my bad. I didn't realize I had an updated version of Java. I've added commands that force the compiler to target Java 11.

coffeebond commented 3 years ago

So I tested the new version with one of my dataset, which has 30,973,950 total reads, or 4,888,620 unique barcodes. I think it worked pretty well. It took only 285 seconds to finish and it gave 1,427,265 clusters.

As a comparison, it took umi_tools 477,381 seconds (5.5 days) to finish clustering the same dataset, which gave me 1,425,307 clusters. I haven't systematically analyzed the difference yet, but the numbers are close.

The speed is really impressive! Thanks for the good work.

Daniel-Liu-c0deb0t commented 3 years ago

Very glad to see that it works well. Let me know if you analyze the differences. I've tried to reverse-engineer all the little nuances that UMI-tools has, in order to get the same results, but it seems that there are always edge cases and other difficulties. Anything differences you spot could be indicative of larger bugs, which is why I am interested.

By the way, what's your workflow? UMI-tools requires alignment first before collapsing aligned reads based on their UMIs. However, you mentioned that you are using UMICollapse's fastq mode, which does not do alignment and instead directly collapses based on the read sequence? If this is the case, then it makes sense that there are differences in the output. Additionally, in this case you may want to increase the number of allowed errors, since instead of the usual 10-20 bp UMIs, we are grouping up longer sequences.

coffeebond commented 3 years ago

I will definitely let you once I figure out what the differences are.

I am directly sequencing the barcodes, which are part of a reporter mRNA. So you can consider the whole dataset points to one mapped location. For UMI-tools, I used their UMIClusterer API, which takes in UMIs that I counted from the input fastq file.

I'm a little hesitant to increase the allowed error because the barcodes are not pre-defined (they were cloned as a random sequence) so I have no idea what the real minimal hamming distance is between any two different barcodes. Having said that, I think getting a minimal hamming distance of 2 among 1 million barcodes sampled from 4^20 possible variants should be really small.

Daniel-Liu-c0deb0t commented 3 years ago

Ah I see. Yeah your case is pretty much the "best case scenario" for UMICollapse, since you basically don't rely on mapping information and therefore the bottleneck of any UMI deduplication software would be on grouping up similar UMIs (which is heavily optimized in UMICollapse). If the reads mapped to different locations, then grouping is much easier since UMIs from different alignment coordinates do not need to be compared at all. Since your barcodes are only 20 bp long, it likely isn't wise to allow more than 1 error.

AbeKh commented 6 months ago

@coffeebond I am attempting something similar. Do you have an updates on the differences in approach?

coffeebond commented 6 months ago

@coffeebond I am attempting something similar. Do you have an updates on the differences in approach?

I gave up on UMI-tools because it's way too slow for me. UMICollapse works well for me.