dedup dir-adjacency too slow/does not complete

royfrancis commented 8 years ago

I started with a 12gb BAM (77 mil reads) using 3 cores and 24GB RAM. It did not complete in 2.5 hours or even close. Output BAM after 2.5 hours was 2.3MB. Then I extracted one chromosome to create a 314MB BAM. Dedup was run on this small BAM for 1 hour and it was still incomplete. A truncated BAM file of 12.4MB was exported. I ran qualimap bamqc on both files (1 chromosome bam before dedup and after dedup).

file	before dedup	after dedup
reads	5,104,918	197,280
dup rate	38.74%	2.27%

So, I suppose the deduplication must be working.

Extrapolating from this run time, my 12gb bam would take lot more than 39 hours. Would this be an expected run time? I was wondering if I was doing something wrong or some setting/switches are not optimal.

This is my script:

umi_tools dedup \
--method="directional-adjacency" \
--paired \
--output-stats="file-dedup.stats" \
-v 3 \
-E "file-dedup.error" \
-L "file-dedup.log" \
-I "file.bam" \
-S "file-dedup.bam"

crutching commented 7 years ago

Adding my own experiences. I have been assessing this collection of tools for use in one of our processing pipelines. Using the 0.4.3 release, I am running a paired-end analysis off of a 36mil line BAM produced from amplicon alignments. I am doing one run only on chromosome 21, and the other is for the whole file. The chr21 run looked to be done in less than a minute, but then has been running for over 2 hours on the unmatched mates stage (~3500 unmatched). The other run claims 110k unmatched mates, and took maybe 30 min to make the first pass through the file.

Pulling the gather_mates branch seems to improve the situation dramatically. I am now seeing no more than a minute spent on the finding mates step, at least when run on the full data set. When running on the single chromosome, I see the following error:

UnboundLocalError: local variable 'gene_tag' referenced before assignment

Not a big deal, it doesn't look like I will need to scatter gather the dedup step now that the runtime is more reasonable, but I thought it was worth bringing up.

Last thing, not completely related to the thread. I generally see entries in the final output that are not completely sorted. Up until this point, I have had to sort before and after dedup so that the next tool in the pipeline won't break. Is this something that you already know about?

Thanks for all the work on this tool!

IanSudbery commented 7 years ago

Thanks for your input. We'll certainly investigate your unbound error.

We're aware of the sort issue: the problem is that the tool outputs reads in read.start position order rather than alignment start position order. (E.g. the 3' end of reads on the reverse strand) we need to think about how to deal with this without having to write out and then sort.

On Thu, 4 May 2017, 9:30 pm jhl667, notifications@github.com wrote:

Adding my own experiences. I have been assessing this collection of tools for use in one of our processing pipelines. Using the 0.4.3 release, I am running a paired-end analysis off of a 36mil line BAM produced from amplicon alignments. I am doing one run only on chromosome 21, and the other is for the whole file. The chr21 run looked to be done in less than a minute, but then has been running for over 2 hours on the unmatched mates stage (~3500 unmatched). The other run claims 110k unmatched mates, and took maybe 30 min to make the first pass through the file.

Pulling the gather_mates branch seems to improve the situation dramatically. I am now seeing no more than a minute spent on the finding mates step, at least when run on the full data set. When running on the single chromosome, I see the following error:

UnboundLocalError: local variable 'gene_tag' referenced before assignment

Not a big deal, it doesn't look like I will need to scatter gather the dedup step now that the runtime is more reasonable, but I thought it was worth bringing up.

Last thing, not completely related to the thread. I generally see entries in the final output that are not completely sorted. Up until this point, I have had to sort before and after dedup so that the next tool in the pipeline won't break. Is this something that you already know about?

Thanks for all the work on this tool!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CGATOxford/UMI-tools/issues/31#issuecomment-299300280, or mute the thread https://github.com/notifications/unsubscribe-auth/AFJFjkwRccUtcPouGLIXga-UZbFvudWWks5r2jVzgaJpZM4IlqZd .