higher than expected memory use

BenGSt commented 1 year ago

Dear Felix,

I'm trying to run human WGBS samples (70 bp paired-end sequencing on Illumina HiSeq 2500, 200M-300M reads) against the hg38 analysis set reference genome (downloaded from USCS) using Bismark with bowtie2 (I built the bowtie indices locally). The samples are first trimmed with trim_galore. Using the default settings for trim galore, the Bismark jobs (single instance on 1 core) are using 20-40GB and in a few cases more than 40GB.

Is this normal? I was expecting ~16GB as mentioned in the user guide.

Thanks! Ben

FelixKrueger commented 1 year ago

The <16GB quoted the in manual were probably still from the GRCh37 genome (hg19), using the chromosomal sequences only. I have seen before that the memory usage can be substantially higher than that if you include haplotype and patch regions. For the Ensembl genome (download link here), you should not be using the toplevel genome:

---------
TOPLEVEL
---------
These files contains all sequence regions flagged as toplevel in an Ensembl
schema. This includes chromsomes, regions not assembled into chromosomes and
N padded haplotype/patch regions.

but rather the primary assembly:

-----------------
PRIMARY ASSEMBLY
-----------------
Primary assembly contains all toplevel sequence regions excluding haplotypes
and patches. This file is best used for performing sequence similarity searches
where patch and haplotype sequences would confuse analysis. If the primary
assembly file is not present, that indicates that there are no haplotype/patch
regions, and the 'toplevel' file is equivalent.

I am not sure whether UCSC has something similar? Otherwise, number of reads per sample should not affect the memory consumption, but I believe the read length and also the paired-end nature of your samples will have some influence.

BenGSt commented 1 year ago

Much appreciated!

BenGSt commented 1 year ago

Dear Felix, I think this is worth a follow-up.

I deleted all reference sequences except for chr1 - chr22, chrX and chrY. I'm not sure If this is equivalent to the Ensembl primary assembly you suggested, but it had little effect. Splitting the input files and running trimming and alignment on each chuck separately brought down memory usage very significantly (~12GB per chunk of 25M reads 70b PE). So the number of reads per sample definitely affects memory usage.

Currently, I'm having difficulty with deduplication. In a certain sample with 500M reads running "deduplicate_bismark --multiple" ends up using more than 40GB of mem. Is the cluster I'm using with its 40GB mem per job limit simply not fit to run Bismark? Is there some way of performing deduplication using less mem?

Respectfully and with appreciation for your hard work and kind support, Ben

FelixKrueger commented 1 year ago

Oh I seem to have missed this issue, apologies. Regarding the deduplication, I am afraid all the read IDs are held in memory, so if you have a TON of reads it will indeed be quite memory intense. You could probably start re-writing deduplicate_bismark in a more memory efficient programming language (like C?), or potentially look into different tools such as Picard (even though I am not entirely sure if it handles bisulfite data particularly well). The easiest solution would probably be to find a bigger machine somewhere, or spin up an EC2 instance for a few hours and run it there....

BenGSt commented 1 year ago

Ok. Thanks!

FelixKrueger / Bismark

higher than expected memory use #605