HudsonAlpha / fmlrc2

Apache License 2.0
43 stars 5 forks source link

Memory consumption #17

Closed Akazhiel closed 2 years ago

Akazhiel commented 3 years ago

Hello,

I'd like to ask if there are any tips to reduce memory consumption of the fmlrc2 process? I'm running it on human data, my FASTA is 111Gb and the index generated with the short reads is 69Gb, but the tool while running is taking 700Gb of RAM. Why does it take that much RAM? Would it be advisable to change some default settings or perhaps reduce the coverage of the short-reads when creating the indes?

Best regards,

Jonatan

holtjma commented 3 years ago

Hello,

The short answer is that fmlrc2 (and it's predecessor fmlrc) both made the intentional choice to sacrifice memory consumption in favor of speed by default. It's basically using an uncompressed index that typically consumes ~12 bits per symbol. However, since it's uncompressed it means random accesses (and therefor queries) can be performed very quickly.

At this time, we don't really have a workaround in fmlrc2. However, if this is a major block you can revert to fmlrc v1 with the -i option which will "build a sampled FM-index instead of bit arrays". You can further reduce the memory consumption on that option by increasing the -F parameter. However, this approach will be quite a bit slower and it will only get slower as you increase that -F parameter.

At the end of the day, it's a trade-off either way. The community as a whole has tended towards sacrificing memory for speed, so that's what we did in fmlrc2 by default. I have it on the backlog to add some options for those who want to make a different choice, but I'm not sure when I'll get around to implementing those in fmlrc2.

Akazhiel commented 3 years ago

Hello,

How much more would be the time increase in changing from using fmlrc2 to using fmlrc1? Because as of now it's been running for 2 days using 20 CPUs and it looks like it'll take 2-3 more days to finish. And with it taking 700Gb of RAM we don't really see it feasible to implement it in our pipeline. It all will depend on how much improvement we get from the correction.

I'm still dumbfounded at how much RAM it uses when the inputs don't even account for 200Gb. Perhaps we should've gone with less coverage for the short-reads seeing as this is what you mentioned in an issue in fmlrc1 repo is what increases the memory.

holtjma commented 3 years ago

How much more would be the time increase in changing from using fmlrc2 to using fmlrc1?

I ran a quick test using the E. coli benchmark data, and it ran in 48m40.583s wall clock with 344m30.815s CPU time on the same test laptop (seems like roughly 20x slower). I use the default of -F 8, but you might want to increase that to at least 10 since your data seems to be very high coverage. Additionally, you could split the long-reads into sub-files to further parallelize if the memory footprint becomes more manageable.

Perhaps we should've gone with less coverage for the short-reads seeing...

I'm not sure how much data you have on either the short- or long-read side of things, and I'm also not entirely sure what "normal" is for human data (the majority of our tests for fmlrc were non-human, so even at high (100x) coverage, the data was comparably small). If you look at Ratatosk, they use:

"The short reads used are Illumina paired-end reads of length 151 bases with a mean coverage of 42x in the Icelandic trios and 61x in the Ashkenazim trio."

However, even with those datasets, the memory usage for fmlrc (this is v1) is still quite high as reported by their Supplement and even higher for the Ratatosk runs. Seems like with human data, this is probably the norm.

Additionally, if you look at some of the other recent comparators, specifically this image, most of these are comparable or just worse on memory consumption. However, none of these were tested with human data, so take caution interpreting it too far.

And with it taking 700Gb of RAM we don't really see it feasible to implement it in our pipeline.

Honestly, if you're memory limited, I think you're going to have a hard time running any of the higher accuracy tools. Ratatosk (which was built specifically for human data) seems to use more memory than fmlrc v1. Some of the tools from the survey paper are more memory efficient, but they come at a significant accuracy cost and I'm unsure of their performance on human data (fmlrc included, we never explicitly tested human). Ultimately, you're probably going to have to make a tough decision about how much you care about accuracy v. compute costs, and then use that to guide whether you are looking for alternate tools/methods or alternate compute approaches (e.g. cloud or increasing your on-prem footprint).

Sorry these are probably not the answers you're hoping for, but I think that's just the state of the art currently. Long-term (and really not that far away), I think all of most of these tools will disappear in favor of HiFi or similar style datasets.

Akazhiel commented 3 years ago

Additionally, you could split the long-reads into sub-files to further parallelize if the memory footprint becomes more manageable.

Do you mean use the same short-read index but on subsets of the long-reads data? Is there any risk of this process causing an overcorrection of the long-reads?

We are not exactly really limited by memory, but we just have a 755Gb RAM cluster with 128CPUs. So it's not ideal for me to use the full resources. Unfortunately our sysadmin caused the process to stop so I'm trying the v1 with the -i option as you mentioned and using 90 cores and is just taking 184Gb RAM. Probably going to take a week at this rate.

Would you perhaps recommend to better use Ratatosk since I'm handling human data and they've tested it with that type of data in mind? We really need the most accuracy we can get since we plan on calling for structural variants.

holtjma commented 3 years ago

Do you mean use the same short-read index but on subsets of the long-reads data? Is there any risk of this process causing an overcorrection of the long-reads?

Yes exactly. In fmlrc (v1 and v2), all long reads are corrected independently so you can split your long read file into as many subfiles as you want and recombine them later. You should get identical answers as if you handed it the original file simply because the reads are all corrected independent of each other.

I would definitely recommend at least reading through the Ratatosk paper to determine if it's right for you (FWIW, I trust the authors of the paper/tool and they would likely be willing to help debug any issues you encounter as well). There are trade-offs (i.e. Ratatosk is more costly from a compute perspective, but according to the paper, also more accurate), so again it's dependent on what's right for your specific experiment/pipeline.

holtjma commented 2 years ago

Closing due to inactivity, feel free to re-open if you have more questions!