Open matthdsm opened 3 years ago
Hi Matthias,
For elPrep 4, we made a predictor for peak RAM use based on a set of benchmark runs. More specifically, we made such a predictor for WGS data for the elPrep filter mode. This gave use the following equation for predicting the RAM use based on input BAM size: Y = 15X + 32.
This means ePrep 4 requires about 32 GB base memory + 15 times the input BAM size (in GB) for the elPrep filter mode, in the case of WGS data. For estimating the memory use for the sfm mode, you would need to look at the BAM size of the largest split file, which can vary for different data sets.
The numbers would look a bit different for WES data. We would also need to update the predictor for elPrep 5.
Does this help? Would it be useful to update a specific predictor for your use case?
Thanks! Charlotte
Hi Charlotte,
Thanks! I ran the numbers and we're getting a bit different results. For an exome of about 8GB we see a RAM usage of about 300GB on average ( 3 tests, with 20, 40 and 80 threads). Anecdotally, the more threads we used, the lower the ram usage was (about 30gb difference between 20 and 80 threads).
An updated predictor would be most welcome!
cheers Matthias
NB, command used was
elprep filter \
$1 \
${1%-sort.bam}.bam \
--nr-of-threads 20 \
--mark-duplicates \
--mark-optical-duplicates ${1%-sort.bam}_duplicate_metrics.txt \
--optical-duplicates-pixel-distance 2500 \
--sorting-order coordinate \
--haplotypecaller ${1%-sort.bam}.vcf.gz \
--reference /references/Hsapiens/hg38/seq/hg38.elfasta \
--target-regions /references/Hsapiens/hg38/coverage/capture_regions/CMGG_WES_analysis_ROI_v2.bed \
--log-path $PWD --timed
``
Hi Matthias,
I have made a preliminary predictor for elPrep 5 based on benchmarks for data samples we have at our lab: Y = 24X + 3. This, however, is quite far from the numbers you saw in your runs.
I have a couple of questions:
Thanks a lot!
Best, Charlotte
Hi Charlotte,
I'll keep you posted!
Matthias
On an unrelated note, While converting a dbsnp vcf to elsites I noticed the process took over the whole node (80 threads) and consumed about 320GB ram. Perhaps some kind of warning should be in place so unsuspecting users don't crash their servers trying to prepare data. Alternatively, an argument to limit resource usage could be added.
Matthias
A quick test with BQSR on 80 threads reduces RAM usage by about 20GB (270GB total), so you were right about it having an effect on the requirements! Matthias
edit: removed off topic remark
@matthdsm I opened new issues for your two side notes. I hope you have been notified by my answers there.
Thanks, Pascal
Duly noted!
Hi Charlotte,
Are there any updates wrt to the RAM usage estimate?
Thanks M
Hi Matthias,
We had a last e-mail exchange to get access to a data file on March 17th. As far as I know, there was never a reply?
Thanks! Charlotte
Right, I lost sight on what had already been done. Let me get back to you!
M
Hi Charlotte,
To get back to this, which compression level do you use for your test input data? That might be the reason your formula doesn't work for our data. Since the input bam is intermediate data, we only use fast compression (e.g. samtools view -1
) to save on time, which results in a bigger bam file.
On a related note, what compression level do you use for the output bams? I noticed the output bam is larger than the input, which usually isn't the case when the data is sorted. M
Hi,
We use default gzip compression level which corresponds to level 6. If I understand it correctly, samtools defaults to uncompressed. So that may indeed have an impact.
I am uncertain why your output bam is larger.
Charlotte
On 22 Oct 2021, at 14:07, Matthias De Smet @.**@.>> wrote:
Hi Charlotte,
To get back to this, which compression level do you use for your test input data? That might be the reason your formula doesn't work for our data. Since the input bam is intermediate data, we only use fast compression (e.g. samtools view -1) to save on time, which results in a bigger bam file.
On a related note, what compression level do you use for the output bams? I noticed the output bam is larger than the input, which usually isn't the case when the data is sorted. M
— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FExaScience%2Felprep%2Fissues%2F44%23issuecomment-949569033&data=04%7C01%7Ccharlotte.herzeel%40imec.be%7C28ef95d469b44077152f08d99554789d%7Ca72d5a7225ee40f09bd1067cb5b770d4%7C0%7C0%7C637705012305365287%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=PPNphI05v4dKXlxe%2B5Cwf6ZR36zrJR%2FXw%2BDy%2FsAh4Q4%3D&reserved=0, or unsubscribehttps://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABPESUWTNI3T6MNQIW65H5DUIFHWVANCNFSM4Y6EAJXQ&data=04%7C01%7Ccharlotte.herzeel%40imec.be%7C28ef95d469b44077152f08d99554789d%7Ca72d5a7225ee40f09bd1067cb5b770d4%7C0%7C0%7C637705012305375246%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=zS6OfU4Ra%2B%2FDuTKT9CBMxtYBXiemZGyMxjQnWO0vdxE%3D&reserved=0. Triage notifications on the go with GitHub Mobile for iOShttps://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapps.apple.com%2Fapp%2Fapple-store%2Fid1477376905%3Fct%3Dnotification-email%26mt%3D8%26pt%3D524675&data=04%7C01%7Ccharlotte.herzeel%40imec.be%7C28ef95d469b44077152f08d99554789d%7Ca72d5a7225ee40f09bd1067cb5b770d4%7C0%7C0%7C637705012305385204%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=QcUypaVkrttEkKguDiFxjUmHexcsxXvQNZm%2BxfdsLcg%3D&reserved=0 or Androidhttps://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fplay.google.com%2Fstore%2Fapps%2Fdetails%3Fid%3Dcom.github.android%26referrer%3Dutm_campaign%253Dnotification-email%2526utm_medium%253Demail%2526utm_source%253Dgithub&data=04%7C01%7Ccharlotte.herzeel%40imec.be%7C28ef95d469b44077152f08d99554789d%7Ca72d5a7225ee40f09bd1067cb5b770d4%7C0%7C0%7C637705012305385204%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=ZwtCAn9W7SQOXLxizAOvd7sKGE4zxvrRfX8TUOcCCGg%3D&reserved=0.
I ran some tests on our infrastructure and came up with Y = 34X + 20
.
Hi,
I'm trying to find a way to get a rough estimate of how much ram I'll need to run elprep filter based on the size of the input bam.
Do you have any way of calculating this, e.g. for when submitting a job to some cloud provider?
Thanks Matthias