Closed royfrancis closed 7 years ago
Adding my own experiences. I have been assessing this collection of tools for use in one of our processing pipelines. Using the 0.4.3 release, I am running a paired-end analysis off of a 36mil line BAM produced from amplicon alignments. I am doing one run only on chromosome 21, and the other is for the whole file. The chr21 run looked to be done in less than a minute, but then has been running for over 2 hours on the unmatched mates stage (~3500 unmatched). The other run claims 110k unmatched mates, and took maybe 30 min to make the first pass through the file.
Pulling the gather_mates branch seems to improve the situation dramatically. I am now seeing no more than a minute spent on the finding mates step, at least when run on the full data set. When running on the single chromosome, I see the following error:
UnboundLocalError: local variable 'gene_tag' referenced before assignment
Not a big deal, it doesn't look like I will need to scatter gather the dedup step now that the runtime is more reasonable, but I thought it was worth bringing up.
Last thing, not completely related to the thread. I generally see entries in the final output that are not completely sorted. Up until this point, I have had to sort before and after dedup so that the next tool in the pipeline won't break. Is this something that you already know about?
Thanks for all the work on this tool!
Thanks for your input. We'll certainly investigate your unbound error.
We're aware of the sort issue: the problem is that the tool outputs reads in read.start position order rather than alignment start position order. (E.g. the 3' end of reads on the reverse strand) we need to think about how to deal with this without having to write out and then sort.
On Thu, 4 May 2017, 9:30 pm jhl667, notifications@github.com wrote:
Adding my own experiences. I have been assessing this collection of tools for use in one of our processing pipelines. Using the 0.4.3 release, I am running a paired-end analysis off of a 36mil line BAM produced from amplicon alignments. I am doing one run only on chromosome 21, and the other is for the whole file. The chr21 run looked to be done in less than a minute, but then has been running for over 2 hours on the unmatched mates stage (~3500 unmatched). The other run claims 110k unmatched mates, and took maybe 30 min to make the first pass through the file.
Pulling the gather_mates branch seems to improve the situation dramatically. I am now seeing no more than a minute spent on the finding mates step, at least when run on the full data set. When running on the single chromosome, I see the following error:
UnboundLocalError: local variable 'gene_tag' referenced before assignment
Not a big deal, it doesn't look like I will need to scatter gather the dedup step now that the runtime is more reasonable, but I thought it was worth bringing up.
Last thing, not completely related to the thread. I generally see entries in the final output that are not completely sorted. Up until this point, I have had to sort before and after dedup so that the next tool in the pipeline won't break. Is this something that you already know about?
Thanks for all the work on this tool!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CGATOxford/UMI-tools/issues/31#issuecomment-299300280, or mute the thread https://github.com/notifications/unsubscribe-auth/AFJFjkwRccUtcPouGLIXga-UZbFvudWWks5r2jVzgaJpZM4IlqZd .
Hi jhl667. Thanks for mentioning the error with the --chrom option. It turns out this option was not covered by testing so this bug was missed when it was introduced. The debug will be included in v0.4.4.
@IanSudbery I'm not sure how we can output in alignment start position order, unless we cache all reads per contig which could be very memory intensive. We might be able to reduce the memory usage somewhat with a sort of 'rolling cache' where we outputted reads from the cache when they were sufficiently upstream of the reads being read in. Even then, chimeric reads pairs would cause problems, unless we retrieve chimeric read pairs with calls to get_mate()
, which would have performance issues. If we did implement sorted output order, I think we would definitely want to make this optional and warn users about the memory and performance issues.
No, I think we would have to output to a temp file and then sort to the output file.
Ah OK. Yeah I see no reason not to add this as an option. Perhaps make clear to the user than this is how its going to work so they don't try and pipe the output, not that I think anyone is probably doing this?
@TomSmithCGAT @IanSudbery Ok, so just to be clear, the sensible thing to do in a workflow utilizing umi_tools dedup is to sort before and after? I haven't actually assessed speed differences between sorting or not before dedup, but I am assuming it would make significant difference? Honestly, for me, including an extra sort step doesn't make that much of a difference, though I still like to keep things as optimized as possible.
hi @jh667. An option to sort the output will be included in version 0.5. See #120.
I'm closing this issue now as the run time issues seem to have been dealt with. Any future issues with run time can be discussed in a separate issue
I started with a 12gb BAM (77 mil reads) using 3 cores and 24GB RAM. It did not complete in 2.5 hours or even close. Output BAM after 2.5 hours was 2.3MB. Then I extracted one chromosome to create a 314MB BAM. Dedup was run on this small BAM for 1 hour and it was still incomplete. A truncated BAM file of 12.4MB was exported. I ran
qualimap bamqc
on both files (1 chromosome bam before dedup and after dedup).So, I suppose the deduplication must be working.
Extrapolating from this run time, my 12gb bam would take lot more than 39 hours. Would this be an expected run time? I was wondering if I was doing something wrong or some setting/switches are not optimal.
This is my script: