Closed YichaoOU closed 1 year ago
Hi,
dedup is linear in the number of positions it considers, but non-linear in the number UMIs per position. The first thing that dedup does after having collected all the UMIs at a position is to build an adjacency matrix based on the edit distance between UMIs found at the same position. The naïve implementation of this is quadratic in the number of UMIs. We have implemented some tricks to bring that down in the average, but its still substantially slower than linear. The next step is to turn that adjacency matrix into an edge list graph representation, break the graph into connected components, and deconvolve the likely explanatory UMIs from that set of connected components. This step is also slower than linear, but rather than scaling by the number of UMIs (nodes), it scales by the number of edges in the graph. The average case performance is not too bad, but in the worst case time (and memory) requirements can explode. This tends to happen in cases where a single positions is approximately ~30% saturated with UMIs (300,000 UMIs in a single position in your case), but where this actaully happens will depend on the precise distribution of UMIs in UMI space.
Because the number of UMIs per position is often uneven, it tends to be the case that run time is dominated by one or two difficult to solve positions. This limits the benefits that are gained from naïve parallelisation (e.g. on a chromosomal basis), but obviously it might help if you had, for example, many difficult positions spread across chromosomes. Of course the other thing to look out for is that when time requirements expand, so do memory requirements, and with naïve parallelisation, memory usage is proportional to the number of processes running.
There are a number of other reasons why a run might be slow. Firstly, make sure that you are not trying to run a whole genome through with --output-stats
, we generally only recommend using that on a subset of the genome these days as it makes things slow and memory hungry as it computes null distributions via simulation.
Secondly the latest release should have improved the performance where you have a large number of contigs and are running in paired mode (this should only really have an effect if you have 1000s of contigs, and I would expect things to scale linearly in this case).
Finally, check you are not running out of memory. The log you posted seems to suggest that your run is slowing down overtime, I would be tempted to check you are not running out of memory and thrashing the disk if you are running this on a system with a swap file enabled (like a desktop or laptop).
Thanks!
I didn't use --output-stats
as it was mentioned in the documentation.
I previously used 1.1.2, and now I use 1.1.4, but the time is basically the same in terms of parsing the first 4M reads.
Memory is not an issue since I'm running it on HPC with huge memory.
The run was finished successfully last night, so it took about 36 hours for my datasets.
Thanks, Yichao
Hi,
I have a bam file with 16 million PE reads. UMI is 10bp. Right now, the program has been running over 24 hours and looking at the log file, it only parsed 9M reads. It seems it will take more than 2 days. And the timeline for processing 1M, 2M, 3M, etc is not linear, where 1M took 3 min, 2M took 14 min, and 3M took 2 hours.
I'm wondering:
Thanks, Yichao