Open shnizzedy opened 3 years ago
I don't know enough to know how much memory is reasonable, but I hear
The run only has one HBN subject (1 anat, 1 func), the anat template is 1mm, the anat image is 1mm iso. The func has 750 volume. But it still makes no sense it can use 45 gb memory!
There are some clear memory spikes for ANTs when it runs on Brainlife (this is one subject with fmriprep-options
):
I'm not sure I quite understand resource_monitor.json
(particularly the time
and cpus
fields), but if the indices of each key correspond to one another, here are the maximum entries for rss_GiB
, vms_GiB
: and cpus
for the run above on brainlife
[{
"name": "resting_preproc_sub-A00013809_ses-DS2.nuisance_regressor_0_0.aCompCor_cosine_filter",
"time": 1605723308.305644,
"rss_GiB": 36.45994949316406,
"cpus": 103.0,
"vms_GiB": 40.09559631347656,
"interface": "Function",
"params": "_scan_rest_acq-645__selector_WM-2mm-M_CSF-2mm-M_tC-5PCT2-PC5_aC-CSF+WM-2mm-PC5_G-M_M-SDB_P-2_BP-B0.01-T0.1",
"mapnode": 0
}, {
"name": "resting_preproc_sub-A00013809_ses-DS2.nuisance_regressor_0_0.aCompCor_cosine_filter",
"time": 1605723285.300852,
"rss_GiB": 28.016590118164064,
"cpus": 832.8,
"vms_GiB": 40.09559631347656,
"interface": "Function",
"params": "_scan_rest_acq-645__selector_WM-2mm-M_CSF-2mm-M_tC-5PCT2-PC5_aC-CSF+WM-2mm-PC5_G-M_M-SDB_P-2_BP-B0.01-T0.1",
"mapnode": 0
}, {
"name": "resting_preproc_sub-A00013809_ses-DS2.nuisance_regressor_0_0.aCompCor_cosine_filter",
"time": 1605723284.15847,
"rss_GiB": 27.693824767578125,
"cpus": 2742.9,
"vms_GiB": 31.022430419921875,
"interface": "Function",
"params": "_scan_rest_acq-645__selector_WM-2mm-M_CSF-2mm-M_tC-5PCT2-PC5_aC-CSF+WM-2mm-PC5_G-M_M-SDB_P-2_BP-B0.01-T0.1",
"mapnode": 0
}]
Another clue, interacting with the graphs on brainlife, I see it's actually run.py
that's eating all the memory (that is, it's C-PAC and not a child process)!
with `run.py` | ![with `run.py`](https://user-images.githubusercontent.com/5974438/101075226-b9d02500-356f-11eb-8be4-222cfc836bcf.png) |
---|---|
without `run.py` | ![without `run.py`](https://user-images.githubusercontent.com/5974438/101075250-c48aba00-356f-11eb-8d52-47153daa88dd.png) |
Those brief spikes are huge!!! Do you also mean that it probably is the run.py, not the ANTs?
Not sure. I think it's probably how C-PAC is allocating memory for ANTs
After @ccraddock explained to me that
mem_gb
parameter of a Nipype Node operates on a newspaper dispenser-style honor system, trusting processes to take no more memory than estimated by the argument given to that parameter
andlog_nodes_cb
includes both the estimated memory usage (given to that parameter) and the observed runtime memory usage for each node,draw_gantt_chart
utility that creates interactive visualizations like
I started to set mem_gb
where the observed runtime_memory_gb
was more than double the estimate (in all cases the estimate was left to the default 0.2) and saw a marked improvement.
There are still some spikes, but they're much less dramatic. I'm iterating now, setting estimates based on logged memory usage after fresh runs after setting more estimates.
This is looking good. I think that there is an issue with the number of threads being estimated by the callback, or the gantt chart creation script is pulling in the wrong numbers. Some of the nodes are reporting using 210 threads!
As for your earlier comment on run.py, I think that since this is the parent process, it 'owns' all of the memory used by the child threads. So the amount of memory attributed to it should be the cumulative amount of memory used by all of the nodes that are currently executing.
I thought maybe runtime_threads
was counting something different than I expected.
I see the profile uses cpu_percent
for runtime_threads
which returns a percentage of a CPU, so I think something like math.ceil(cpu_percent)/100
would be an estimate of the number of threads, but there's some disconnected code that looks like it collects the actual number of threads used (as opposed to percentage of 1 CPU).
I'll try updating the callback to get the actual number of threads, run a few C-PAC runs and see how it looks.
Awesome @shnizzedy !!
This looks like it's working for memory.
I'm going through now and adding n_procs
similarly to control the number of threads
and adding to the developer docs as I go
Describe the bug
This is particularly a problem on clusters that have hard memory limits.
Here's an example
command.txt
that uses too much memory:(same thing, wrapped for ease of reading):
To Reproduce Steps to reproduce the behavior:
and/or
, eg,
--preconfig fmriprep-options
n_cpus
andmem_gb
mem_gb
exceeded by ANTsExpected behavior
ANTs uses no more than
mem_gb
memory at a time.Versions
Additional context
These issues are almost certainly a result of this issue:
Possibly related: https://github.com/FCP-INDI/C-PAC/issues/1054, https://github.com/nipy/nipype/issues/2776
— My registration fails with an error: Memory errors. ANTs wiki
There seems to be no specific memory limitations in the
antsRegistration
command apart from--float
:antsJointLabelFusion.sh
relies onsbatch
,rev
andcut
to limit memory:Other leads: maybe use
LegacyMultiProc
, https://github.com/nipreps/fmriprep/issues/836, https://github.com/nipreps/fmriprep/pull/839, https://github.com/nipreps/fmriprep/pull/854, https://github.com/nipy/nipype/pull/2284 (maxtasksperchild 1
? https://nipype.readthedocs.io/en/latest/api/generated/nipype.pipeline.plugins.legacymultiproc.html), https://github.com/nipy/nipype/issues/2548, https://github.com/nipy/nipype/pull/2773