⚡️ Make memory estimates data-dependent

shnizzedy commented 3 years ago

Related problem

The 'insufficient resources' error at the beginning is an attempt to reduce the instances of out-of-memory crashes during runs, to be able to adjust up-front, but we need to do some tuning; right now, the memory estimates are hard-coded and data-independent, based on common data sizes, but obviously the estimates will be too big for small runs and too small for big runs until we make those estimates data-dependent.

These nodes have estimates > 4 GB:

node estimate (GB)

ALFF bandpass filter 13

compcor DetrendPC 12.5

apply ANTs warp 10

cosine filter 8

FLIRT resample 8

functional to standard composite transform 8

mask to EPI 8

reho_map 6

mean filter 5

VMHC apply_transform 5

spatial regression 4.5

node	estimate (GB)
ALFF bandpass filter	13
compcor DetrendPC	12.5
apply ANTs warp	10
cosine filter	8
FLIRT resample	8
functional to standard composite transform	8
mask to EPI	8
reho_map	6
mean filter	5
VMHC apply_transform	5
spatial regression	4.5

Originally posted by @shnizzedy in https://github.com/FCP-INDI/C-PAC/issues/1479#issuecomment-812185689

Related #1166, #1301, #1404, #1453

Proposed feature

Starting with the memory-hungriest nodes (those in the table above),

determine how much memory per data unit (e.g. memory per TR at resolution)
make that data unit available to the functions calling those nodes
apply that memory estimate when calling the node, in place of the defaults above

Additional context

Some of these estimates could potentially get complicated as the pipeline progresses through transforms and resamplings.

shnizzedy commented 3 years ago

MM suggested using [1200, 600, 300, 150, 75] TRs × a subject from [HCP, HBN, HNU, NYU test/retest] to generate the initial estimates to generate the formula for estimating.

shnizzedy commented 3 years ago

The threading issue I'm working on currently is dealing with is making memory estimates data-dependent.

Before 1.8, we were mostly relying on the default per-node memory estimate of 0.2 GB; in 1.8 we raised the default to 2.0 GB and set some higher estimates based on observed memory usage. These estimates are still just flat estimates that don't allocate well for varying data shapes.

For example, a BOLD image with dimensions 90 × 104 × 72 × 300 uses about 12.5 GB for apply_warp_warp_ts_to_T1template (to a 256 × 320 × 320 T1 image). One with 90 × 104 × 72 × 1200 uses about 50 GB for the same node. Our current estimate for that node is 10 GB. An overrun of 2.5 GB isn't great, and one of 40 GB is rough, but with a powerful enough system and soft limits will still complete.

The issue is really a problem when multiple nodes that don't allocate enough memory try to run at the same time.

The particular pipeline configuration that raised the issue here has a bunch of forks (and the data have several functional images per subject, like
    ├── anat
    │   └── sub-105923_ses-retest_T1w.nii.gz
    └── func
        ├── sub-105923_ses-retest_task-restLRbeltrvt_run-1_bold.json
        ├── sub-105923_ses-retest_task-restLRbeltrvt_run-1_bold.nii.gz
        ├── sub-105923_ses-retest_task-restLRbeltrvt_run-2_bold.json
        ├── sub-105923_ses-retest_task-restLRbeltrvt_run-2_bold.nii.gz
        ├── sub-105923_ses-retest_task-restLRpredrvt_run-1_bold.json
        ├── sub-105923_ses-retest_task-restLRpredrvt_run-1_bold.nii.gz
        ├── sub-105923_ses-retest_task-restLRpredrvt_run-2_bold.json
        ├── sub-105923_ses-retest_task-restLRpredrvt_run-2_bold.nii.gz
        ├── sub-105923_ses-retest_task-restLR_run-1_bold.json
        ├── sub-105923_ses-retest_task-restLR_run-1_bold.nii.gz
        ├── sub-105923_ses-retest_task-restLR_run-2_bold.json
        └── sub-105923_ses-retest_task-restLR_run-2_bold.nii.gz
).

I ran the subject above single-threaded with a single functional image 3 times (in full (1200 timepoints), truncated to 600 timepoints and truncated to 300 timepoints) to get the callback.logs and make the hungriest nodes adjust estimates based on the number of timepoints, to get this particular set of runs going. As a follow-up I plan to

run the same configuration with images with other spatial resolutions

run other configurations

to better tune the memory estimation formulas.

In the back of my mind, I have a lingering concern that there's something screwy with the way nipype is allocating/monitoring threads beyond how the log is reporting the number of threads. But I know the memory overrun issue is real, so I'm starting there.

― me, in an email

shnizzedy commented 3 years ago

Per @anibalsolon's suggestion, I'm trying to override Node's mem_gb @property to dynamically consider input files: https://github.com/FCP-INDI/C-PAC/commit/0626a669c2afb9a116e4777860e22d794c1610cd

For the first iteration, I'm only considering the time dimension as variable. Once that works, I'll refactor to include x, y, and z dimensions and see if I can figure out a way to simplify the paramaterization. Currently,

https://github.com/FCP-INDI/C-PAC/blob/0626a669c2afb9a116e4777860e22d794c1610cd/CPAC/pipeline/nipype_pipeline_engine/engine.py#L68-L79

like https://github.com/FCP-INDI/C-PAC/blob/0626a669c2afb9a116e4777860e22d794c1610cd/CPAC/generate_motion_statistics/generate_motion_statistics.py#L196-L199

for a memory estimate of 0.4 + 0.0033 * {number of timepoints in 'in_file'}.

It would of course be better to

just pass 'in_file' instead of the whole lambda **kwargs: kwargs['in_file'] every time
be dynamic to x, y, z and t instead of just t

For now, I just plugged in a couple of these and kicked off a couple runs to test the proof of concept.

FCP-INDI / C-PAC

⚡️ Make memory estimates data-dependent #1480