TomMaullin / BLMM

This repository contains all code for the BLMM toolbox.
19 stars 5 forks source link

Dask update #64

Closed vinferrer closed 1 year ago

vinferrer commented 2 years ago

Hello Tom,

I manage to launch BLMM with your version in the cluster, it seems to be working. However I did notice a huge amount of memory usage, more than expected by the ´MAXMEM´ parameter

vinferrer commented 2 years ago

Basically I had to enlarge the memory usage from 16Gb to 20 per job. This may be an unexpected behavior you want to review

vinferrer commented 2 years ago

I managed to record the whole cluster memory usage: image

As you can see the problem lies in the batch jobs

vinferrer commented 2 years ago

I suppose this is a bit out of the scope of this PR. But I just wanted to let you know

TomMaullin commented 2 years ago

I managed to record the whole cluster memory usage: image

As you can see the problem lies in the batch jobs

Hi @vinferrer, can I just confirm with you quickly what cluster memory here refers to? Could this be your storage in your file system and not the memory used during computation?

vinferrer commented 2 years ago

image As you can see the problem lies in the batch jobs

Hi @vinferrer, can I just confirm with you quickly what cluster memory here refers to? Could this be your storage in your file system and not the memory used during computation?

Hey @TomMaullin . I am pretty sure this is total cluster RAM memory. Let me elaborate, When I first executed your pipeline I realised MAXMEM: 2*32 was insufficient because I was using that same parameter as limit of memory per worker and the pipeline started crashing. For that reason I decided to increase worker memory limit independently of the MAXMEM parameter. That's how I got the 21474836480 which is a 20Gb limit per worker. We can agree this is a lot of memory per worker and clearly is a behavior that wasn't expected. That's why I wanted to monitor the actual amount of ram used and check it with the graph

vinferrer commented 2 years ago

If you prefer it, here i have a graph using only one worker in the whole pipeline: image

As you can seem there are systematic peaks of 16 Gb of ram at the end of job. I suppose this is probably related to the nib.save process

TomMaullin commented 1 year ago

This PR has now been adapted and merged into the updated codebase.