Closed gordonrix closed 8 months ago
The memory usage depends not just on the total % of UMI space used, but how densely the subpart of the space that is used is.
You currently have 904608 UMIs.
With a 20nt UMI and an edit threshold of 2, each UMI has a 6060=3600 UMI edit ball around it. If each of your 904608 UMIs were full connected to their edit ball that would mean the network had 9046083600 connections in it. Thats around 3.3 billion connections. (of course that can't actaully be true, because only when every UMI in the space is used can every UMI be fully connected, but lets assume the majority, in the centre of the network, are fully connected). If each connection takes ~40 bytyes, that gives total memory of around 121GB.
The easiest way to reduce this would be to set te edit distance threshold to 1, rather than 2. One mght argue that if nanopore is sufficiently error prone, then you will get reads with 2 errors in them. This is undoubtly true. But my guess is that you would also get the 1 edit intermediates in higher number, and so the networks would still build.
The edit ball around an UMI with an edit distance of 1 is 60 times smaller than the edit distance around an UMI with an edit of 2.
Thanks for the fast and thorough explanation!
It's currently running with an edit distance of 2 and available memory of ~2x your estimated usage. Hopefully that will complete and I will then be able to compare to an edit distance of 1 and decide if it's worth doing 2 in the future.
One thing I'm a bit unclear on though is the assumption that the majority of UMIs in the center of the network are connected. With how sparsely I should be sampling the UMI space (there should only be ~100,000 true UMIs, but we'll see how that estimation holds up), I would expect that only about 1/10 of the UMIs are fully connected. I suppose with some error in my library prep or assumptions of memory usage, that could still reasonably use up 60GB?
Haven't looked into this much, but do you know of an easy way to track memory usage on a cluster?
Once my job(s) finish, I'll post results from running with an edit distance of 1, as well as 2 if that works, and close the issue.
Sorry, that was a worst case calculation.
Closing due to inactivity.
Hello! I've been developing a pipeline that involves consensus sequence generation of PCR amplicons using a custom script to extract barcodes and appending them to the read name, then umi-tools group to group the UMIs appropriately, then using another custom script to generate consensus sequences from the group output.
Because the nanopore platform produces lots of errors, reads map to slightly different positions despite being reads of the same amplicon, and I need to use a networking method from umi-tools to account for the inevitable many errors. I am using the --per-gene option so that all reads are considered to be the same position and only UMIs are used for grouping, but this should be fine because my UMI extraction step ensures that only true reads of the target amplicon make it into the umi-tools group step.
This has worked well for sequencing runs with on the order of 100,000 reads as input, but when I tried running it with a sequencing dataset that had 1.6 million reads of the 1.6 kb target gene as input (2.5 gb BAM file), it consumed all memory available, ~60 GB, only generating the following output before triggering the job to be killed.
Based on some previous issues threads, I tried running it again with --method=unique, and that was successful despite being run on my local machine with only 10 gb of RAM available:
My UMI is 20 nucleotides long, so even if all the UMIs were unique, this is < 0.0002% of the possible UMIs so I should be in good shape there. Seems like this is a fairly unique use case though so perhaps there is some option you know of that might be able to help reduce memory usage? I'm currently trying to run it again using a higher memory node, but long term I'd like to be able to run this step locally, so being able to keep it under ~60 gb of memory usage would be very helpful.