CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
481 stars 190 forks source link

dedup running very slow #583

Closed YichaoOU closed 1 year ago

YichaoOU commented 1 year ago

Hi,

I have a bam file with 16 million PE reads. UMI is 10bp. Right now, the program has been running over 24 hours and looking at the log file, it only parsed 9M reads. It seems it will take more than 2 days. And the timeline for processing 1M, 2M, 3M, etc is not linear, where 1M took 3 min, 2M took 14 min, and 3M took 2 hours.

I'm wondering:

  1. Why the processing time is not linear? Since the UMI deduplication is based on each position and for each position, we have a group of reads and we try to dedup this group, it should be relatively linear, right?
  2. A simple parallelization would be to split the bam into chromosomes and run dedup on each and merge?

Thanks, Yichao

2023-03-01 12:05:11,428 INFO Written out 100000 reads
2023-03-01 12:07:53,244 INFO Written out 200000 reads
2023-03-01 12:07:54,914 INFO Written out 300000 reads
2023-03-01 12:08:01,838 INFO Written out 400000 reads
2023-03-01 12:08:05,549 INFO Parsed 1000000 input reads
2023-03-01 12:08:07,374 INFO Written out 500000 reads
2023-03-01 12:08:12,339 INFO Written out 600000 reads
2023-03-01 12:08:34,034 INFO Written out 700000 reads
2023-03-01 12:22:46,498 INFO Parsed 2000000 input reads
2023-03-01 13:06:40,268 INFO Written out 800000 reads
2023-03-01 13:59:55,246 INFO Written out 900000 reads
2023-03-01 14:11:00,813 INFO Parsed 3000000 input reads
2023-03-01 14:20:44,228 INFO Written out 1000000 reads
2023-03-01 14:20:45,882 INFO Written out 1100000 reads
2023-03-01 14:21:39,660 INFO Written out 1200000 reads
2023-03-01 14:21:39,945 INFO Parsed 4000000 input reads
2023-03-01 16:29:57,880 INFO Written out 1300000 reads
2023-03-01 16:45:23,226 INFO Parsed 5000000 input reads
2023-03-01 16:52:14,563 INFO Written out 1400000 reads
2023-03-01 16:52:26,094 INFO Parsed 6000000 input reads
2023-03-01 20:45:01,986 INFO Written out 1500000 reads
2023-03-02 00:25:40,632 INFO Parsed 7000000 input reads
2023-03-02 04:26:38,440 INFO Written out 1600000 reads
2023-03-02 04:26:56,290 INFO Written out 1700000 reads
2023-03-02 04:26:59,858 INFO Parsed 8000000 input reads
2023-03-02 08:42:50,177 INFO Written out 1800000 reads
2023-03-02 08:42:50,768 INFO Parsed 9000000 input reads
IanSudbery commented 1 year ago

Hi,

dedup is linear in the number of positions it considers, but non-linear in the number UMIs per position. The first thing that dedup does after having collected all the UMIs at a position is to build an adjacency matrix based on the edit distance between UMIs found at the same position. The naïve implementation of this is quadratic in the number of UMIs. We have implemented some tricks to bring that down in the average, but its still substantially slower than linear. The next step is to turn that adjacency matrix into an edge list graph representation, break the graph into connected components, and deconvolve the likely explanatory UMIs from that set of connected components. This step is also slower than linear, but rather than scaling by the number of UMIs (nodes), it scales by the number of edges in the graph. The average case performance is not too bad, but in the worst case time (and memory) requirements can explode. This tends to happen in cases where a single positions is approximately ~30% saturated with UMIs (300,000 UMIs in a single position in your case), but where this actaully happens will depend on the precise distribution of UMIs in UMI space.

Because the number of UMIs per position is often uneven, it tends to be the case that run time is dominated by one or two difficult to solve positions. This limits the benefits that are gained from naïve parallelisation (e.g. on a chromosomal basis), but obviously it might help if you had, for example, many difficult positions spread across chromosomes. Of course the other thing to look out for is that when time requirements expand, so do memory requirements, and with naïve parallelisation, memory usage is proportional to the number of processes running.

There are a number of other reasons why a run might be slow. Firstly, make sure that you are not trying to run a whole genome through with --output-stats, we generally only recommend using that on a subset of the genome these days as it makes things slow and memory hungry as it computes null distributions via simulation.

Secondly the latest release should have improved the performance where you have a large number of contigs and are running in paired mode (this should only really have an effect if you have 1000s of contigs, and I would expect things to scale linearly in this case).

Finally, check you are not running out of memory. The log you posted seems to suggest that your run is slowing down overtime, I would be tempted to check you are not running out of memory and thrashing the disk if you are running this on a system with a swap file enabled (like a desktop or laptop).

YichaoOU commented 1 year ago

Thanks!

I didn't use --output-stats as it was mentioned in the documentation.

I previously used 1.1.2, and now I use 1.1.4, but the time is basically the same in terms of parsing the first 4M reads.

Memory is not an issue since I'm running it on HPC with huge memory.

The run was finished successfully last night, so it took about 36 hours for my datasets.

Thanks, Yichao