Closed vinferrer closed 1 year ago
Basically I had to enlarge the memory usage from 16Gb to 20 per job. This may be an unexpected behavior you want to review
I managed to record the whole cluster memory usage:
As you can see the problem lies in the batch
jobs
I suppose this is a bit out of the scope of this PR. But I just wanted to let you know
I managed to record the whole cluster memory usage:
As you can see the problem lies in the
batch
jobs
Hi @vinferrer, can I just confirm with you quickly what cluster memory here refers to? Could this be your storage in your file system and not the memory used during computation?
As you can see the problem lies in the
batch
jobsHi @vinferrer, can I just confirm with you quickly what cluster memory here refers to? Could this be your storage in your file system and not the memory used during computation?
Hey @TomMaullin . I am pretty sure this is total cluster RAM memory. Let me elaborate, When I first executed your pipeline I realised MAXMEM: 2*32
was insufficient because I was using that same parameter as limit of memory per worker and the pipeline started crashing. For that reason I decided to increase worker memory limit independently of the MAXMEM
parameter. That's how I got the 21474836480
which is a 20Gb limit per worker. We can agree this is a lot of memory per worker and clearly is a behavior that wasn't expected. That's why I wanted to monitor the actual amount of ram used and check it with the graph
If you prefer it, here i have a graph using only one worker in the whole pipeline:
As you can seem there are systematic peaks of 16 Gb of ram
at the end of job. I suppose this is probably related to the nib.save process
This PR has now been adapted and merged into the updated codebase.
Hello Tom,
I manage to launch BLMM with your version in the cluster, it seems to be working. However I did notice a huge amount of memory usage, more than expected by the ´MAXMEM´ parameter