Open gdevenyi opened 2 years ago
Is VOLUME_CACHE_THRESHOLD
set ?
The minc-toolkit default environment sets it as:
VOLUME_CACHE_THRESHOLD=-1
Hmm, quite possible it is a memory leak in the C bindings or similar.
Is this memory measurement the whole machine or your R process? If the latter, do you know if the dip at 3.30pm corresponds to the beginning of writing files?
Do I have access to your modules/files on Niagara? If so I could try to debug by running under Valgrind but I'm pretty unfamiliar with RMINC internals so might be challenging.
Is this memory measurement the whole machine or your R process?
This is the Niagara readout from the slurm whole-machine statistics
If the latter, do you know if the dip at 3.30pm corresponds to the beginning of writing files? That's a really good question, that dip is pretty big and all the allocations should be done at that point. I'm not sure.
Do I have access to your modules/files on Niagara?
Yes
export QUARANTINE_PATH=/project/m/mchakrav/quarantine
module use ${QUARANTINE_PATH}/modules
module load cobralab
For now, we're addressing this by randomizing the list of files to write out and repeating the job so we'll eventually get them all.
Just as a quick note -- as a better workaround than randomizing, you could probably run mincWriteVolume from short-lived subprocesses e.g. using batchMap
from batchtools
with local multiprocessing backend.
Do I have access to your modules/files on Niagara?
Yes
export QUARANTINE_PATH=/project/m/mchakrav/quarantine module use ${QUARANTINE_PATH}/modules module load cobralab
For now, we're addressing this by randomizing the list of files to write out and repeating the job so we'll eventually get them all.
Thanks. Is there any chance you could give me read access (via extended ACLs, say) to the data directory as well?
I have a special share for that,
/scratch/m/mchakrav/share
You have read-write there. Data is still copying. I suggest ~30 minutes wait.
Are the jacobians being copied as well ?
Are the jacobians being copied as well ?
Yes. Warning: the file paths are all absolute path. Student will be talked to.
We're running the following code on Niagara:
And seeing the following memory performance:
During this time, we loop through and get ~29 files saved, before the system runs out of memory and kills R.
we're not creating any new memory-holding objects as far as I understand, but R memory consumption rises and eventually fills the node.