dereneaton / ipyrad

Interactive assembly and analysis of RAD-seq data sets
http://ipyrad.readthedocs.io
GNU General Public License v3.0
70 stars 39 forks source link

Running out of Hard Drive Space: How to Predict Temp File Sizes? #457

Closed alexkrohn closed 2 years ago

alexkrohn commented 2 years ago

Hi there,

I was wondering if there was a way to predict how much space an analysis will take. I'm trying to branch my analysis at step 3 to try different clustering thresholds on my data. The raw data are ~60GB, but during step 3, the temporary alignment files balloon to over 500GB, fill up my hard drive and stop the analysis. I can run the various branches to output into different hard drives, each with > 500 GB free, but that is somewhat annoying. I've noticed that lower clustering thresholds take up more space (probably due to over-splitting loci), but I don't have a good idea of how much space a dataset will need given a starting amount of raw data. Generally I'm working with pop gen data of one species.

Is there an easy way to guestimate how much room an ipyrad run will need so I can be a bit more proactive in emptying space?

Thanks for your help,

Alex

isaacovercast commented 2 years ago

Hi Alex,

TL;DR There isn't really a great way to predict how much disk space an assembly will take. You could have two assemblies that have the same amount of raw data that end up consuming vastly different amounts of total disk space. Keep reading for more details....

You have identified a couple of the key factors in predicting temporary file size usage (raw data size and clustering threshold). A few of the other variables at play are number of samples, genome size, cutter frequency, and sequencing error rate (maybe a couple more i'm forgetting atm). All of these factors interact in a complex way which makes it really difficult to come up with a good formula for converting these into guestimates of temp file usage. I think 10x raw data size is the right order of magnitude for how much disk an assembly will consume, but beyond that it gets harder to predict with precision.

That all being said, given the same raw data, if you run a couple different assemblies with different clust_thresholds and the step 3 directory is around 500GB each time, then you can probably expect it not to change too much with further tweaking (like it's not going to get 10x bigger or smaller).

This isn't exactly an ipyrad issue, so I'm going to close it for now, but feel free to email me directly if you have more questions, or jump on the gitter channel if you want to ask others for their opinions: https://gitter.im/dereneaton/ipyrad

Hope that helps! -isaac

alexkrohn commented 2 years ago

That does help. Thank you very much!

On Sat, Sep 4, 2021 at 2:11 PM Isaac Overcast @.***> wrote:

Closed #457 https://github.com/dereneaton/ipyrad/issues/457.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dereneaton/ipyrad/issues/457#event-5256718285, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADAEIHGYYQTW5QW67YBZTMTUAJONNANCNFSM5DMDOUPA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.