dedup dir-adjacency too slow/does not complete

royfrancis commented 8 years ago

I started with a 12gb BAM (77 mil reads) using 3 cores and 24GB RAM. It did not complete in 2.5 hours or even close. Output BAM after 2.5 hours was 2.3MB. Then I extracted one chromosome to create a 314MB BAM. Dedup was run on this small BAM for 1 hour and it was still incomplete. A truncated BAM file of 12.4MB was exported. I ran qualimap bamqc on both files (1 chromosome bam before dedup and after dedup).

file	before dedup	after dedup
reads	5,104,918	197,280
dup rate	38.74%	2.27%

So, I suppose the deduplication must be working.

Extrapolating from this run time, my 12gb bam would take lot more than 39 hours. Would this be an expected run time? I was wondering if I was doing something wrong or some setting/switches are not optimal.

This is my script:

umi_tools dedup \
--method="directional-adjacency" \
--paired \
--output-stats="file-dedup.stats" \
-v 3 \
-E "file-dedup.error" \
-L "file-dedup.log" \
-I "file.bam" \
-S "file-dedup.bam"

TomSmithCGAT commented 8 years ago

Hi Roy,

I've not used qualimap bamqc but from a quick read of the documentation it appears it detects duplicates using the start position. If you're using an aligner which can "clip" the reads it may miss-estimate the duplication rate. Would you mind running Picard MarkDuplicates as well to compare the output?

Regarding the time taken, this is indeed very slow. Would you be OK to share with me the one chromosome 314MB undeduped BAM and I can see whether I get the same performance and if so where the bottleneck is.

royfrancis commented 8 years ago

Hi Tom, Here is the bam file for one chr. Paired-end .fq files were processed using umi_tools extract and mapped to Zebrafish Genome using STAR (default settings with gtf). I haven't used MarkDuplicates, but I can look into that.

royfrancis commented 8 years ago

MarkDuplicates does not seem to complete (no output bam or metrics report) with truncated BAM file.

TomSmithCGAT commented 8 years ago

Quick side note. You have lots of multimapped reads in your BAM.

samtools view file.bam|cut -f12|sort|uniq -c|sort gives:

26598 NH:i:10 28680 NH:i:9 32512 NH:i:8 34716 NH:i:7 47274 NH:i:6 65130 NH:i:5 132028 NH:i:4 284914 NH:i:3 735984 NH:i:2 3717082 NH:i:1

As it stands, for a read which maps to 10 locations, each mapping location will be considered independently. Thus, the final output will contain between 0-N alignments for each read, where N is the number of alignments (NH tag). I'm going to add an option to discount secondary and supplementary alignments which you might want to consider.

TomSmithCGAT commented 8 years ago

On to the real issue of the time taken.

This looks to be solely due to the time taken to find the paired end mates. You can test this yourself by dropping the --paired option. I don't have the profile stats yet because it's still not finished 2 hours later(!).

You have a relatively low duplication rate so most positions contains just a single read. The outputting therefore requires many calls to the pysam.AlignmentFile.mate() function which is not an efficient way to retrieve the paired end read. Ideally, when dealing with paired reads, the BAM is sorted by read name. However, we need a coordinate sorted BAM in order to perform the duplicate detection. It looks like we might need to implement so sort of caching to speed up the retrieval of the paired ends.

@AndreasHeger Do you have any suggestions for how to best speed up the paired end retrieval? Is caching the right way to go?

AndreasHeger commented 8 years ago

                                                                                  Hi Tom, mate is indeed best avoided in large datasets. Caching is better, but might be complicated to implement if you want to output a sorted file. It might be easier doing two passes on the bam. ‎To conserve memory,  you could do the double pass per chromosome or smaller region. Best, andreas                                                                                                                                                                                                                                                                                                                                         Sent from my BlackBerry 10 smartphone on the EE network.                                                                                                                                                                                                                From: Tom SmithSent: Wednesday, 25 May 2016 16:23To: CGATOxford/UMI-toolsReply To: CGATOxford/UMI-toolsCc: Andreas Heger; MentionSubject: Re: [CGATOxford/UMI-tools] dedup dir-adjacency too slow/does not complete (#31)On to the real issue of the time taken.

This looks to be solely due to the time taken to find the paired end mates. You can test this yourself by dropping the --paired option. I don't have the profile stats yet because it's still not finished 2 hours later(!).

You have a relatively low duplication rate so most positions contains just a single read. The outputting therefore requires many calls to the pysam.AlignmentFile.mate() function which is not an efficient way to retrieve the paired end read. Ideally, when dealing with paired reads, the BAM is sorted by read name. However, we need a coordinate sorted BAM in order to perform the duplicate detection. It looks like we might need to implement so sort of caching to speed up the retrieval of the paired ends.

@AndreasHeger Do you have any suggestions for how to best speed up the paired end retrieval? Is caching the right way to go?

—You are receiving this because you were mentioned.Reply to this email directly or view it on GitHub

IanSudbery commented 8 years ago

Thanks for that @AndreasHeger. The output is already non-sorted, so perhaps we should consider caching.

TomSmithCGAT commented 8 years ago

@IanSudbery

A very simplistic "two-pass" approach would be to run through the bamfile one chromosome at a time and store the reads as a ordered dictionary mapping reads ids to their paired reads, then re-pass through the keys to perform the dedup, emptying the dictionary after each chromosome. I'll test this out as an immediate solution.

Alternatively, we could cache as we go but this will be a little more complicated within our current workflow? I'll have a think about how to best perform the caching.

Either way, this would have to be done on a per-chromosome basis to avoid potentially huge memory requirements for chimeric read pairs. Perhaps we should fall back on the .mate() function for these read pairs?

What do you think?

IanSudbery commented 8 years ago

@TomSmithCGAT Two pass is definitely easier to implement, but would need to do very careful memory profiling. When I was caching whole chromosomes before memory usage was getting up in the 10s of GB for large files. But was caching at the read bundle level (i.e. pre-cluster collapsing), so might be better with caching after this.

I think falling back on the .mate() for chimeric reads is the way to go.

A dynamic cache, that saved read2's in a dict, and then throw them away, or outputted them as their mates were processed would probably be more memory efficient, but harder to do. Most reads would not spend long in the cache because their mate would be nearby. Might also be slower though because for every read processed, you'd have to check if its mate was in a the cache, and add it to a cache if not.

TomSmithCGAT commented 8 years ago

@IanSudbery Could you take a look at a 2pass option I've implemented in 35c776c.

I had to change things around a bit to allow the 2 pass option to store the read pairs on a per-contig basis. Note the new class "Writer" which takes the reads, passes them to the "ClusterAndReducer" functor and has method to write out the resultant reads and stats.

I've tested it on the above single chromosome BAM (5M reads, 2M unique read pairs). The original code took 19 hours! With "--2pass" it takes 22 mins. The maximum Virtual Memory Size according to "ps" was 2.1 Gb.

IanSudbery commented 8 years ago

@TomSmithCGAT: A couple of points

1) What happens if there are multimappers? 2) Isn't it more memory efficient to dedup, save the read1s you want to output, and then go through looking for the read2s? 3) I think the execution path is getting a bit complicated.

I'm working on a suggestion.

TomSmithCGAT commented 8 years ago

@IanSudbery 1) I think this is a general area that needs more consideration. I'm not sure there is any way to sensibly deal with secondary/supplementary alignments in the deduplication unless we keep another dictionary mapping primary to secondary/supp. alignments and output all alignments if the primary is retained. We wouldn't want to output secondary/supp. alignments in cases where the primary was removed would we? As it stands, the secondary/supp. alignments are skipped.

2) You're right, that's a much better approach! E.g run the 2 pass the other way around - thanks!

3) I agree. I guess the execution path will change for your implementation based on point 2?

IanSudbery commented 8 years ago

RE: multimappers - I think the cleanest way to deal with multimappers, is to say its not our problem. The user should deal with them independently. If they want to leave them, in, thats their decision -i think we should leave them in if they are there. The only correct alternative would be to consider all positions simultaneously. Thats beyond our scope at the moment i reckon.

Im less sure about how to deal with chimeric alignments.

TomSmithCGAT commented 8 years ago

I agree this should be up to the user. On reflection perhaps the secondary/supp. skipping should be optional rather than enforced. I think we should mention multi-mapping in the documentation.

I think we should take a similar approach to chimeric reads - leave them in (pull out the paired end with .mate() as we discussed) and have an option to skip them entirely at the users choosing. Out of interest, what TLEN do chimeric reads get in the BAM?

IanSudbery commented 8 years ago

I'd go further than that. I'd say that if they want to filter out secondary/supplimentary alignments, let them do it themselves; in the way they see fit.

I'm not really sure how mate pairing works in the presence of supplementary alignments. Is the mate of the primary read1, the primary read2? or just the next on in the chain? I suspect this vaires from mapper to mapper. Could be the same is true of TLEN.

Checkout my proposal for the 2 pass pairing in bbbe020. Its a bit more complicated than I'd anticipated because it has to deal with the chimeric alignment problem, but I think its still represents less radical surgery to the execution path. I profiled using:

time -v umi_tools dedup -I in.bam -S /dev/null --paired -v5 [--2pass]

(I didn't add a seperate option for 2pass, but made it how paired retrieval works)

Its about 25% slower than your solution (5mins 30sec rather than 4mins 20 for the test BAM), but uses a quarter as much RAM. (resident set 500Mb vs 1.8GB).

TomSmithCGAT commented 8 years ago

@IanSudbery I like the simplicity and cleanness of re-defining the outfile object and it's write and close methods for paired end reads. Nice!

I beleive lines 959-964 need to be removed so that read pairs are only written once when using the --ignore-umi option.

I'm surprised it's any slower to be honest but the gain in memory efficiency more than outweighs this small difference.

TomSmithCGAT commented 8 years ago

Regarding secondary/supp. alignments in paired end, STAR appears to output a proper pair of alignments for so each alignment1 has it's own alginment2, even where the alignment1 is the same in both cases. I expect this will indeed vary for other mappers. Let's do as you say and leave it to the user to generate a suitable BAM but include some more information in the documentation regarding multimapping.

IanSudbery commented 8 years ago

@TomSmithCGAT Thanks for catching that - i've sorted it and submitted a PR.

WRT secondary/supp - I was more worried about supplementary than secondary - i.e. where different parts of read1 are mapped two several locations, as are several parts of read2. I think STAR outputs these to a separate file, but I think BWA puts them all in the same file.

TomSmithCGAT commented 8 years ago

@royfrancis Could you update UMI-Tools to the latest version (v.0.2.0). This should speed up the de-duplication stage considerably.

royfrancis commented 8 years ago

Hi Tom, I just tried on the same 1 chromosome bam and it completed in 60 mins (2 cores+16gb RAM). For comparison, I filtered uniquely mapped reads into a new bam file and ran dedup again. This time it finished in 37 mins. The good news is that dedup reaches completion. I estimate my original bam file (unique filtered) to take about 16 hours. Btw, does this new version v.0.2.0 incorporate the changes in extract #26?

IanSudbery commented 8 years ago

Hi @royfrancis,

I don't suppose its much help, but you are probably getting hammered by the stats computation: running your chr1 BAM on my machine takes 2 mins 30 without stats and 25 mins with stats. This is probably due us not finding an efficient way to randomly sample reads from a BAM file.

royfrancis commented 8 years ago

I ran my full bam (unique filtered, 7gb) yesterday for 18 hours and it still failed. Is there some way to output the progress, so as to estimate how much of the bam file was completed?

IanSudbery commented 8 years ago

Hi Roy,

If you turn the verbosity up to 5 with -v5, a message should be printed to the log every 10,000 alignments processed.

IanSudbery commented 8 years ago

Hi @royfrancis, While we work on the speed and memory efficiency of the stats module, perhaps one solution would be run dedup without stats on your complete data set, and just run the stats module on a one chromosome subset.

royfrancis commented 8 years ago

It still did not work for me after a 10hr run time. I am not sure what stats is, but I excluded --output-stats=, like so

umi_tools dedup \
--method="directional-adjacency" \
--paired \
-v 5 \
-E "fn-error.txt" \
-L "fn-dedup-log.txt" \
-I "file.bam" \
-S "file-dedup.bam"

IanSudbery commented 8 years ago

Yes, exactly.

IanSudbery commented 8 years ago

Hi Roy, was there anything in the error or log file?

royfrancis commented 8 years ago

Here is the log file after running the full bam for 5 hours (3 cores, 12gb ram) without stats. The run was incomplete and timed out. A 1.5gb dedup bam file was produced. The error file was empty. log.txt

TomSmithCGAT commented 7 years ago

@royfrancis Did you make any more progress with this? Have you tried re-running with the latest version of UMI-tools?

royfrancis commented 7 years ago

Sorry. I haven't had the time to try it out. I am occupied with different work till March/April.

gpratt commented 7 years ago

To chime in, I'm using the most up to date version of UMI tools, and I'm still getting hit by very slow computation for a number of my paired-end datasets, I'm up to 36 hours in a number of cases. I'm pretty busy at the moment too, but this looks like its still an issue worth working on.

TomSmithCGAT commented 7 years ago

Thanks for the update @gpratt. How many reads do you have per sample? 36 hours sounds like too long unless you have an incredibly high depth and/or very high error rates. The run time should be linear with respect to the number of reads for samples with very low duplication but it will be quadratic with respect to the complexity of the networks for samples with higher duplication rates.

Would you be OK to share a subset of the data (e.g chr19) so we can find the bottleneck for your sample?

gpratt commented 7 years ago

Here is my full sample. After looking at the datasets that failed / slowly processed, I don't think its an issue with number of reads in the dataset, what I sent you has 12,935,516 reads, pre-duplicate removal. I've deduped many datasets with a higher sequencing depth.

I'm going to guess the problem is with binding locations. NSUN2, the RBP that I gave you is a known tRNA binder, the majority of the reads map to just a few tRNAs in the genome. Maybe the bundle is getting too big to handle?

Just so you know the bam file I gave you has a chromosome for each repetitive element in the human genome, in addition to the regular chromosomes. Its probably not an issue because I've processed a bunch of datasets using the same approach with no slowdowns, but its still non-standard so worth thinking about.

Thanks for the help!

gpratt commented 7 years ago

I did a bit of digging myself to see if I could figure out what the problem is. For my issue the slowdown lies in breadth_first_search. The way its implemented with the loop is really inefficient for large graphs. I'm pretty sure the slowdown has to do with the difference update code.

Good news is switching to breadth_first_search_recursive gives a huge performance boost, and gives a better boost for larger adjacency lists.

If you want to try and reproduce yourself, I pulled and explored bundles from this chrom

infile = pysam.Samfile("https://s3-us-west-1.amazonaws.com/sauron-yeo/NSUN2.bam")
inreads = infile.fetch(reference="tRNA-Pro-TGG-2-1_withgenomeflank")

There are a few bundles in that chromosome that are > 10,000.

From a different issue it looks like you're actively working on c or cython version of the part thats slow, so it might be best to wait till you get that working properly.

TomSmithCGAT commented 7 years ago

Hi @gpratt. I'm really glad to hear breadth_first_search_recursive improves the run time - this was the first thing I'd planned to check. We had planned to implement the recursive search from version 0.3.5 onwards but the output wasn't consistent with the non-recursive search and I haven't worked out why yet. Do you get the same output for the recursive and non-recursive search functions with that single chromosome?

The C library will indeed hopefully speed things up. This will also make the network methods available independently of UMI-tools which we hope will prove useful where others have similar issues with sequencing errors etc. This is a relatively low priority issue though so this could take a while.

gpratt commented 7 years ago

I wasn't able to check that chromosome, from my profiling it looks like the non-recursive version of the breath first search would take over 24 hours to finish on a single bundle, and there are 3-4 bundles in that chromosome like that.

Checking that is low priority for me too, although if I get some downtime, I'm in this problem enough that I'll come back to it. FYI I'm pretty sure (although haven't totally thought it out the entire algorithm) that the slowdown from the non-recursive version comes from the difference update line, the wikipedia version does two loops, instead of using the fancy set features. If the recursive version has problems with overflows, that might be a better solution anyway.

TomSmithCGAT commented 7 years ago

I'll have a play with the breath_first_search and see if we can reduce the run time by removing some of the set operations.

I downloaded the sample you linked to but there doesn't appear to be any UMIs in this sample?

AAACCCCACC:HWI-D00611:179:C7MR1ANXX:5:1213:10274:14684  163 chr1    26547764    255 43M =   26547775    49  GGCCTGGTGTGGTGGCTCATGCCGGTAATCCCAGCACTTTGGG FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:1  HI:i:1  AS:i:80 nM:i:0  NM:i:0  MD:Z:43 jM:B:c,-1   jI:B:i,-1   RG:Z:foo

gpratt commented 7 years ago

Woops, forgot adjust the barcode from out in-house location to yours. For some reason years ago I decided that the UMI should go in front of the read name, AAACCCCACC is the UMI in this case.

Either way, I've uploaded a fixed version of the bam file to the same location, should work out of the box now.

I also did what I said would happen, got bored and wrote the fix :). It worked well on my test case that previously finished running, and now I'm testing it on the files that were issues in the past. I'll let you know if those worked / finished soon.

jblachly commented 7 years ago

Hi @TomSmithCGAT and @IanSudbery

Congratulations on your paper! Have been following since the blog post. I've just loaded up UMI-tools for the first time this weekend as we finally have our data back, and have also been struck by the incredibly slow speed. Found this issue and thought I would contribute some data. Perhaps it is a misconfiguration on my end? Installed with sudo pip and have also run it once using sudo umi_tools dedup as instructed. There is no CPU or memory pressure on this system. Is it possible cython is not being engaged properly?

Anyway, using ca. 1.5M reads of paired-end targeted resequencing data (where an individual locus could have high [>1000X] depth) I am seeing 90 minute runtimes. Unfortunately worsening is exponential -- latest one took 9 hrs for 2.5M reads. I wrote a script to downsample a ~1.5M read BAM file to 10%, 20%, 40%, and 80% and check times both with and without stats. FWIW, the proportional contribution of stats generation to overall runtime seems to be less the more reads one has, so I don't think in my case the suggestion earlier in the thread that this could be stats-related is entirely accurate.

Anyway, here are the data. Values are in seconds. Yes, the last row is downsampled to 99% (not 100%) as script was easier that way =) Please let me know if you have suggestions to speed this up as it is kind of a blocker ...

reads	no stats	with stats
149749	69	98
298075	237	307
593073	837	1031
1174608	3232	3664
1446610	4766	5312

gpratt commented 7 years ago

Try installing from the master branch rather than from pip. My version of this issue was fixed a week or two ago in master, but I'm guessing the change hasn't been pushed to conda / pip yet.

Gabriel Pratt Bioinformatics Graduate Student, Yeo Lab University of California San Diego

On Sun, Apr 2, 2017 at 8:10 PM, James S Blachly, MD < notifications@github.com> wrote:

Hi @TomSmithCGAT https://github.com/TomSmithCGAT and @IanSudbery https://github.com/IanSudbery

Congratulations on your paper! Have been following since the blog post. I've just loaded up UMI-tools for the first time this weekend as we finally have our data back, and have also been struck by the incredibly slow speed. Found this issue and thought I would contribute some data. Perhaps it is a misconfiguration on my end? Installed with sudo pip and have also run it once using sudo umi_tools dedup as instructed. There is no CPU or memory pressure on this system. Is it possible cython is not being engaged properly?

Anyway, using ca. 1.5M reads of paired-end targeted resequencing data (where an individual locus could have high [>1000X] depth) I am seeing 90 minute runtimes. Unfortunately worsening is exponential -- latest one took 9 hrs for 2.5M reads. I wrote a script to downsample a ~1.5M read BAM file to 10%, 20%, 40%, and 80% and check times both with and without stats. FWIW, the proportional contribution of stats generation to overall runtime seems to be less the more reads one has, so I don't think in my case the suggestion earlier in the thread that this could be stats-related is entirely accurate.

Anyway, here are the data. Values are in seconds. Yes, the last row is downsampled to 99% (not 100%) as script was easier that way =) Please let me know if you have suggestions to speed this up as it is kind of a blocker ... reads no stats with stats 149749 69 98 298075 237 307 593073 837 1031 1174608 3232 3664 1446610 4766 5312

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CGATOxford/UMI-tools/issues/31#issuecomment-291039828, or mute the thread https://github.com/notifications/unsubscribe-auth/ABNUW4AH_w9OzXJDksagVNDmlqnvtW_0ks5rsGMzgaJpZM4IlqZd .

jblachly commented 7 years ago

@gpratt I have commit latest commit from master branch, version 0.4.3, installed through pip. They certainly are on the ball with pip.

TomSmithCGAT commented 7 years ago

Hi @jblachly. Thanks for the detailed break down of the run time. From what you say, it sounds like the issue is the network building stage which is O(n^2) since perform an all vs. all comparison between the UMIs. Would you mind sharing your BAM so we can profile the script and see what the bottleneck is?

IanSudbery commented 7 years ago

Hi @TomSmithCGAT , expect that I don't think the stats/no stats difference tallies with that: When building stats, we do the all vs all comparisons with the randomly sampled reads, as well as the actual reads at the locus. Thus should stats not at the very least double the amount of time spent building edit-distance matricies? This suggests that once again it is the connected component finding, rather than the edit distance calculation that is to blame as we don't actually build the components for the stats?

Interestingly, the increase we are seeing in the those numbers is sub-quadratic (every doubling in reads is leading to a three fold increase in time, where as quadratic would predict a four-fold increase). To me, the problem here is not the increase in time with increasing depth, but the large amount of time taken to process even the smallest sample. However the jump 1.5hrs -> 9hrs for 1.5M->2.5M reads doesn't seem to fit that pattern.

@jblachly Most of our users are doing RNA seq and often see their run times dominated by the single highest depth locus. They are seeing runtimes of days caused by deduplication problems with single locations with 10s or even 100s of thousands of reads - but they only have a single locus like this. By having 1000s of reads per locus, but many loci like this, you may have hit on a particularly pathological case. None-the-less, 90 seconds for 100,000 reads feels wrong.

You signature states you are an MD, and you are doing resequencing, so I realise that this might be patient data that you are unable to share. However, one solution to this that I can see is that you strip out the sequence from the BAM files - BAM files are still valid if the sequence is replaced with '*'. Two ways we can do this: the tool bam2bam from the cgat package has a --strip-sequence option, or you could use samtools view to convert to SAM then use awk to remove the sequence column. We could then rebuild the BAM at our end. If you send SAM, make sure you zip it first!

If you are worried about us knowing the identity of the genes you are resequencing, then I suggest adding a secret number to each read position (the same number to every read, or even a different number for every gene - as long as it was the same number within a gene).

jblachly commented 7 years ago

Dear @IanSudbery , thanks for your willingness to help with this. I have sent you a note under separate cover with a link to the BAM files.

IanSudbery commented 7 years ago

Okay, I have located the problem here...

You have many reads where the read 1 is on a different chromosome to its mate. This causes some issues as the end of each chromosome we output all the mates from a particular chromosome. If the mates aren't found, they accumulate in a buffer and we go searching for them at the end. This exhaustive search at the end is slow, and only there as a last resort.

It is particularly slow in your case because to do the search, all reads that overlap the supposed location are retrieved and searched one by one for the mate. As you have may reads at one location, this is slow. I'm working on something better.

jblachly commented 7 years ago

@IanSudbery -- thank you so much for tracking this down. We are working with cancer samples so the mate mapping to different contig is going to be a common theme for us (and others in this field). Let me know if I can help in any way with the improvement (I can test other branches etc.).

IanSudbery commented 7 years ago

Dear @jblachly,

I've push a branch "getting_mates" (see commit c96dd45). It definitely makes things faster for you. On my machine your samples finish in 2 minutes and 3 minutes respectively without stats. The only question is what they do to other people.

@TomSmithCGAT Instead of trying to fetch all the reads at the mate location and then sorting through them looking for the read in question, this commit just reopens the input BAM file, and scans through the whole thing looking for mates. For the files at hand it takes seconds (1.5M and 2.5M reads respectively). I don't know what it would do to someone with 100M reads where the mates were not all concentrated in one location.

TomSmithCGAT commented 7 years ago

@IanSudbery. At worst this will only ever mean passing through the whole BAM file once more which is tolerable compared to the pathologically poor performance in this case when using the index to try and specifically retrieve the missing pairs. This will also obviously only ever be triggered when the user has chimeric reads since otherwise the read1s set will be emptied with each call to write_mates. I've made a couple of comments on the commit. I'm very happy to merge this into master when it's ready.

It looks like this method of outputting the orphan read pairs was implemented in the first instance of TwoPassPairWriter so it could well explain some of the other slow run times?

IanSudbery commented 7 years ago

Could be.

On Tue, 4 Apr 2017, 4:48 pm Tom Smith, notifications@github.com wrote:

@IanSudbery https://github.com/IanSudbery. At worst this will only ever mean passing through the whole BAM file once more which is tolerable compared to the pathologically poor performance in this case when using the index to try and specifically retrieve the missing pairs. This will also obviously only ever be triggered when the user has chimeric reads since otherwise the read1s set will be emptied with each call to write_mates. I've made a couple of comments on the commit. I'm very happy to merge this into master when it's ready.

It looks like this method of outputting the orphan read pairs was implemented in the first instance of TwoPassPairWriter so it could well explain some of the other slow run times?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CGATOxford/UMI-tools/issues/31#issuecomment-291543073, or mute the thread https://github.com/notifications/unsubscribe-auth/AFJFjmG_Ea4yz0Pl_M1fQItKDOj011qSks5rsmZkgaJpZM4IlqZd .

TomSmithCGAT commented 7 years ago

P.s Amazing improvement in run time by the way @IanSudbery. That's >1 hour to <5 mins, right?! Thanks @jblachly for bringing this to our attention and sharing your data with us. Hopefully, we've covered most of the unusual BAMs now and the run time issues have been largely resolved :crossed_fingers:

CGATOxford / UMI-tools

dedup dir-adjacency too slow/does not complete #31