CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
480 stars 190 forks source link

dedup dir-adjacency too slow/does not complete #31

Closed royfrancis closed 7 years ago

royfrancis commented 8 years ago

I started with a 12gb BAM (77 mil reads) using 3 cores and 24GB RAM. It did not complete in 2.5 hours or even close. Output BAM after 2.5 hours was 2.3MB. Then I extracted one chromosome to create a 314MB BAM. Dedup was run on this small BAM for 1 hour and it was still incomplete. A truncated BAM file of 12.4MB was exported. I ran qualimap bamqc on both files (1 chromosome bam before dedup and after dedup).

file before dedup after dedup
reads 5,104,918 197,280
dup rate 38.74% 2.27%

So, I suppose the deduplication must be working.

Extrapolating from this run time, my 12gb bam would take lot more than 39 hours. Would this be an expected run time? I was wondering if I was doing something wrong or some setting/switches are not optimal.

This is my script:

umi_tools dedup \
--method="directional-adjacency" \
--paired \
--output-stats="file-dedup.stats" \
-v 3 \
-E "file-dedup.error" \
-L "file-dedup.log" \
-I "file.bam" \
-S "file-dedup.bam"
crutching commented 7 years ago

Adding my own experiences. I have been assessing this collection of tools for use in one of our processing pipelines. Using the 0.4.3 release, I am running a paired-end analysis off of a 36mil line BAM produced from amplicon alignments. I am doing one run only on chromosome 21, and the other is for the whole file. The chr21 run looked to be done in less than a minute, but then has been running for over 2 hours on the unmatched mates stage (~3500 unmatched). The other run claims 110k unmatched mates, and took maybe 30 min to make the first pass through the file.

Pulling the gather_mates branch seems to improve the situation dramatically. I am now seeing no more than a minute spent on the finding mates step, at least when run on the full data set. When running on the single chromosome, I see the following error:

UnboundLocalError: local variable 'gene_tag' referenced before assignment

Not a big deal, it doesn't look like I will need to scatter gather the dedup step now that the runtime is more reasonable, but I thought it was worth bringing up.

Last thing, not completely related to the thread. I generally see entries in the final output that are not completely sorted. Up until this point, I have had to sort before and after dedup so that the next tool in the pipeline won't break. Is this something that you already know about?

Thanks for all the work on this tool!

IanSudbery commented 7 years ago

Thanks for your input. We'll certainly investigate your unbound error.

We're aware of the sort issue: the problem is that the tool outputs reads in read.start position order rather than alignment start position order. (E.g. the 3' end of reads on the reverse strand) we need to think about how to deal with this without having to write out and then sort.

On Thu, 4 May 2017, 9:30 pm jhl667, notifications@github.com wrote:

Adding my own experiences. I have been assessing this collection of tools for use in one of our processing pipelines. Using the 0.4.3 release, I am running a paired-end analysis off of a 36mil line BAM produced from amplicon alignments. I am doing one run only on chromosome 21, and the other is for the whole file. The chr21 run looked to be done in less than a minute, but then has been running for over 2 hours on the unmatched mates stage (~3500 unmatched). The other run claims 110k unmatched mates, and took maybe 30 min to make the first pass through the file.

Pulling the gather_mates branch seems to improve the situation dramatically. I am now seeing no more than a minute spent on the finding mates step, at least when run on the full data set. When running on the single chromosome, I see the following error:

UnboundLocalError: local variable 'gene_tag' referenced before assignment

Not a big deal, it doesn't look like I will need to scatter gather the dedup step now that the runtime is more reasonable, but I thought it was worth bringing up.

Last thing, not completely related to the thread. I generally see entries in the final output that are not completely sorted. Up until this point, I have had to sort before and after dedup so that the next tool in the pipeline won't break. Is this something that you already know about?

Thanks for all the work on this tool!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CGATOxford/UMI-tools/issues/31#issuecomment-299300280, or mute the thread https://github.com/notifications/unsubscribe-auth/AFJFjkwRccUtcPouGLIXga-UZbFvudWWks5r2jVzgaJpZM4IlqZd .

TomSmithCGAT commented 7 years ago

Hi jhl667. Thanks for mentioning the error with the --chrom option. It turns out this option was not covered by testing so this bug was missed when it was introduced. The debug will be included in v0.4.4.

TomSmithCGAT commented 7 years ago

@IanSudbery I'm not sure how we can output in alignment start position order, unless we cache all reads per contig which could be very memory intensive. We might be able to reduce the memory usage somewhat with a sort of 'rolling cache' where we outputted reads from the cache when they were sufficiently upstream of the reads being read in. Even then, chimeric reads pairs would cause problems, unless we retrieve chimeric read pairs with calls to get_mate(), which would have performance issues. If we did implement sorted output order, I think we would definitely want to make this optional and warn users about the memory and performance issues.

IanSudbery commented 7 years ago

No, I think we would have to output to a temp file and then sort to the output file.

TomSmithCGAT commented 7 years ago

Ah OK. Yeah I see no reason not to add this as an option. Perhaps make clear to the user than this is how its going to work so they don't try and pipe the output, not that I think anyone is probably doing this?

crutching commented 7 years ago

@TomSmithCGAT @IanSudbery Ok, so just to be clear, the sensible thing to do in a workflow utilizing umi_tools dedup is to sort before and after? I haven't actually assessed speed differences between sorting or not before dedup, but I am assuming it would make significant difference? Honestly, for me, including an extra sort step doesn't make that much of a difference, though I still like to keep things as optimized as possible.

TomSmithCGAT commented 7 years ago

hi @jh667. An option to sort the output will be included in version 0.5. See #120.

TomSmithCGAT commented 7 years ago

I'm closing this issue now as the run time issues seem to have been dealt with. Any future issues with run time can be discussed in a separate issue