ANTsX / ANTs

Advanced Normalization Tools (ANTs)
Apache License 2.0
1.21k stars 381 forks source link

antsMultivariateTemplateConstruction2 itk:MemoryAllocationError on big computer #1764

Closed ArnimJenett closed 1 month ago

ArnimJenett commented 4 months ago

Operating system and version

Ubuntu 22.04.4

CPU architecture

x86_64 (PC, Intel Mac, other Intel/AMD)

ANTs code version

2.5.1.post73-g3a788f8

ANTs installation type

Compiled from source

Summary of the problem

Hi All, I am trying to build a template calling antsMultivariateTemplateConstruction2.sh with 12 files

time ${SCRIPTS_PATH}/antsMultivariateTemplateConstruction2.sh -d 3 -b 1 -c 2 -j 60 -m CC -o TPS_ -q 100x100x70x20 -r 1 -z ${initTemplate} ${samples[@]} > >(tee -a stdout-build-template.txt >/dev/null) 2> >(tee -a stderr-build-template.txt >> stdout-build-template.txt)

... and after reading the first 5 files properly, I get 7 times a memory allocation error from ITK

terminate called after throwing an instance of 'itk::MemoryAllocationError'
  what():  /home/teforadmin/bin/ants/staging/include/ITK-5.4/itkImportImageContainer.hxx:179:
Failed to allocate memory for image.

... which I can not explain.

I saw this problem popping up in other issues (#917, #934) but the explanations there do not fit my situation. The system I am using has 72 cores and 512BG of RAM and the files are not overly big. From PrintHeader I computed this from one of the bigger files (.nii.gz)

Checking properties of image: C1-220601ABa_2822d_5dpf_AJ-110-DR_1024_Nh_merge-cc-crp.head.lr.nii.gz
Image Dimensions: 514 348 269
Bit Depth: 16 bits
Memory needed: 96233136 bytes (93977 KB, 91 MB, .08 GB)

I experimented with ANTSPexec.sh, implementing a memory limit to be checked before submitting another job. However, neither free nor vmstat show/detect low memory while/before itk throws the error.

Can somebody please help me understand why I am getting these errors - and how to overcome them?

Commands to reproduce the problem.

#!/bin/bash
<<README
test-MemoryAllocationError.sh
This script shall demonstrate the occurrence of an itk:MemoryAllocationError
when running antsMultivariateTemplateConstruction2.sh with 12 files.

Usage:
- Download files from https://filesender.renater.fr/?s=download&token=6c3fc7b4-b96a-4511-aa6c-c6617d4a6f51
- make fileList (absolute paths to downloaded images, one per line)
- call this script with fileList as the first and only parameter.
README

if [[ ! -f "$1" ]];then
    echo "ERROR: you need to provide a file list (absolute paths). Exiting." 
    exit 
else
    fileList="$1"
fi

SCRIPTS_PATH="${ANTSPATH}"
templateFileNames=($(cat $fileList))

maxIndex=12
initTemplate=${samples[0]}

# get list of files to use in the template
for (( j=0 ; j<=$maxIndex; j++ ));
    do
    sid=$(echo ${templateFileNames[$j]} |cut -d "_" -f 2)
    ln ${templateFileNames[$j]} ./${sid}.nii.gz
done
samples=($(find $(pwd) -name "*.nii.gz"))

time "${SCRIPTS_PATH}/antsMultivariateTemplateConstruction2.sh" -d 3 -b 1 -c 2 -j 60 -m CC -o TPS_ -q 100x100x70x20 -r 1 -z ${initTemplate} ${samples[@]} > >(tee -a stdout-build-template.txt >/dev/null) 2> >(tee -a stderr-build-template.txt >> stdout-build-template.txt)

Output of the command with verbose output.

stdout-build-template.txt stderr-build-template.txt

Data to reproduce the problem

data downloadable from https://filesender.renater.fr/?s=download&token=6c3fc7b4-b96a-4511-aa6c-c6617d4a6f51

cookpa commented 4 months ago

The pexec options are confusingly written, something I need to address. Running -c 2 -j 60 means it will try to run up to 60 parallel processes. If you have 12 input images, it will run a maximum of 12 processes to do all the pairwise registrations. However, it is not wise enough to know that it should assign 60/12 = 5 cores to each registration.

The per-process number of threads is controlled by the environment variable ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS. If you want to use 60 cores total, I would do

export ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS=5

If ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS is undefined, each registration might attempt to use all the CPUs on the system, which could increase the memory demands.

From your logs, I suspect it's failing when calling N4BiasFieldCorrection. Could you try setting the number of threads as above, and then running with -n 0?

cookpa commented 4 months ago

I found another potential problem

$ PrintHeader C1-201208Fc_968k_5dpf_AJ-110-DR_1024_Nh_merge-cc-crp.head.lr.nii.gz  
 Spacing [0.00245586, 0.00245586, 0.00245088]
 Origin [0, 0, 0]
 Direction 
1 0 0
0 1 0
0 0 1

 Size : 515 275  247

  Image Dimensions   : [515, 275, 247]
  Bounding Box       : {[0 0 0], [1.26477 0.67536 0.605367]}
  Voxel Spacing      : [0.00245586, 0.00245586, 0.00245088]
  Intensity Range    : [0, 4095]
  Mean Intensity     : 315.585
  Direction Cos Mtx. : 
1 0 0
0 1 0
0 0 1

  Voxel->RAS x-form  : 
  Image Metadata: 
    ITK_original_direction of unsupported type N3itk6MatrixIdLj3ELj3EEE
    ITK_original_spacing of unsupported type NSt3__16vectorIdNS_9allocatorIdEEEE
    ITK_sform_corrected = NO
    bitpix = 16
    cal_max = 4096
    cal_min = 0
    datatype = 4
    dim[0] = 4
    dim[1] = 515
    dim[2] = 275
    dim[3] = 247
    dim[4] = 1
    dim[5] = 1
    dim[6] = 0
    dim[7] = 0
    dim_info = 0
    intent_code = 0
    intent_p1 = 0
    intent_p2 = 0
    intent_p3 = 0
    nifti_type = 1
    pixdim[0] = 0
    pixdim[1] = 2.45586
    pixdim[2] = 2.45586
    pixdim[3] = 2.45088
    pixdim[4] = 1
    pixdim[5] = 0
    pixdim[6] = 0
    pixdim[7] = 0
    qfac = 0
    qform_code = 0
    qform_code_name = NIFTI_XFORM_UNKNOWN
    qoffset_x = 0
    qoffset_y = 0
    qoffset_z = 0
    qto_xyz of unsupported type N3itk6MatrixIfLj4ELj4EEE
    quatern_b = 0
    quatern_c = 0
    quatern_d = 0
    scl_inter = 0
    scl_slope = 1
    sform_code = 0
    sform_code_name = NIFTI_XFORM_UNKNOWN
    slice_code = 0
    slice_duration = 0
    slice_end = 246
    slice_start = 0
    srow_x = 0 0 0 0
    srow_y = 0 0 0 0
    srow_z = 0 0 0 0
    toffset = 0
    vox_offset = 352
    xyzt_units = 3

I see here that your xyzt_units=3, indicating microns. The first images have xyzt_units=0, which will default to being read as mm. The default parameters for N4 are set up for a human size brain.

I would advise setting all the headers consistently such that the pixel dimensions are read as 2.45 mm.

ArnimJenett commented 4 months ago

Thank you, @cookpa , for your quick an thorough explanation.

The ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS was set already to 6 but I set to 5 anyhow and run the pexec with -n 0 as suggested.

Until now it appears to work (still running) as it should; even before fixing the images, which I am in the process of doing right now - for consistency sake.

Thanks a lot!

cookpa commented 4 months ago

Great, I'm glad it's working.

Internally, ITK always works in mm coordinates, so images in other units should be converted at run time. In reality, this was not always done consistently (eg, I made some fixes here). So registrations that appear to work with different spatial units might break in the future when the ANTs ITK version is updated.

I think the approach taken in the first images, effectively scaling up the images so that the pixels are ~ 1mm, is the right thing to do. There's various numerical issues with very small spacings that are quite hard to debug, and many small animal imagers do this scaling to improve stability with ANTs and other tools.

I also noticed the input images were stored without qform or sform transforms. Reading such images relies on backwards compatibility with Analyze, and ITK will not write NIFTI images without a valid header transform - the output of registration will have a qform and sform. ITK will try to keep it consistent but you need to be careful that your template does not get flipped with respect to the input images, and that the input images have the same orientation. You can preview how ANTs interprets the image space by loading images into ITK-SNAP.

ArnimJenett commented 4 months ago

As it took me a little to read up on qform and sform I started a test-run with images consistently set to microns (xyzt_units = 3). This approach failed producing the GenericAffine artefacts so I am now running the same process on images (in original resolution) modified to yield

...
Voxel Spacing      : [1.22793, 1.22793, 1.23]
...
qform_code_name = NIFTI_XFORM_UNKNOWN
...
sform_code_name = NIFTI_XFORM_UNKNOWN
...
xyzt_units = 2

... in PrintHeader and will report back here in the coming days.

I highly appreciate the help, thank you very much.

cookpa commented 4 months ago

Yes, sorry if I was unclear, I think setting qform_code_name = NIFTI_XFORM_UNKNOWN is the right way. Then the spacing will be interpreted as [1.22793, 1.22793, 1.23] mm, which will help the registration.

ArnimJenett commented 4 months ago

Unfortunately, I can only report that the template-building based on 12 modified (units=mm) images didn’t run through either - but for a new reason, I guess. From stderr-build-template.txt :

 57 Killed
 58 Transform file does not exist: ./TPS_2819d30GenericAffine.mat
 59 Killed
 60 Transform file does not exist: ./TPS_2856f00GenericAffine.mat
 61 Killed
 62 Transform file does not exist: ./TPS_968k60GenericAffine.mat
 63 Killed
 64 Transform file does not exist: ./TPS_2790i40GenericAffine.mat
 65 Killed
 66 Transform file does not exist: ./TPS_2822l50GenericAffine.mat
 67 Killed
 68 Transform file does not exist: ./TPS_2790k70GenericAffine.mat

… which means the following jobs failed:

job_0000_metriclog.txt
job_0003_metriclog.txt
job_0004_metriclog.txt
job_0005_metriclog.txt
job_0006_metriclog.txt
job_0007_metriclog.txt

I find that remarkable, as these are not the last jobs (nor the first).

From job_0003_metriclog.txt (randomly selected from the failed jobs):

448      Total field sigma (voxel space) = 0
449      Update field time sigma = 0
450      Total field time sigma  = 0
451      Number of time indices = 0
452      Number of time point samples = 0
453 Registration using 3 total stages.
454 
455 Stage 0
456   iterations = 1000x500x250x0
457   convergence threshold = 1e-06
458   convergence window size = 10
459   number of levels = 4
460   using the Mattes MI metric (number of bins = 32, weight = 1, use gradient filter = 0)
461   preprocessing:  winsorizing the image intensities
462   preprocessing:  histogram matching the images
463   Shrink factors (level 1 out of 4): [6, 6, 6]
464   Shrink factors (level 2 out of 4): [4, 4, 4]
465   Shrink factors (level 3 out of 4): [2, 2, 2]
466   Shrink factors (level 4 out of 4): [1, 1, 1]
467   smoothing sigmas per level: [4, 2, 1, 0]
468   regular sampling (percentage = 0.25)
469 
470 *** Running Euler3DTransform registration ***
471 
472 DIAGNOSTIC,Iteration,metricValue,convergenceValue,ITERATION_TIME_INDEX,SINCE_LAST
473  2DIAGNOSTIC,     1, -3.822195529938e-01, inf, 3.3437e+01, 3.3437e+01,
…
586  2DIAGNOSTIC,    31, -5.463204383850e-01, 1.026196628118e-06, 5.8184e+02, 4.9209e+00,
587 DIAGNOSTIC,Iteration,metricValue,convergenceValue,ITERATION_TIME_INDEX,SINCE_LAST
588  2DIAGNOSTIC,     1, -5.288691520691e-01, inf, 7.2418e+02, 1.4234e+02,
589   Elapsed time (stage 0): 8.4275e+02
590 
591 
592 Stage 1
593   iterations = 1000x500x250x0
594   convergence threshold = 1.0000e-06
595   convergence window size = 10
596   number of levels = 4
597   using the Mattes MI metric (number of bins = 32, weight = 1.0000e+00, use gradient filter = 0)
598   preprocessing:  winsorizing the image intensities
599   preprocessing:  histogram matching the images
600   Shrink factors (level 1 out of 4): [6, 6, 6]
601   Shrink factors (level 2 out of 4): [4, 4, 4]
602   Shrink factors (level 3 out of 4): [2, 2, 2]
603   Shrink factors (level 4 out of 4): [1, 1, 1]
604   smoothing sigmas per level: [4, 2, 1, 0]
605   regular sampling (percentage = 2.5000e-01)
606 
607 *** Running AffineTransform registration ***
608 
609 DIAGNOSTIC,Iteration,metricValue,convergenceValue,ITERATION_TIME_INDEX,SINCE_LAST
610  2DIAGNOSTIC,     1, -6.035985946655e-01, inf, 1.7990e+01, 1.7990e+01
…
696  2DIAGNOSTIC,    32, -5.574958324432e-01, 1.005075887406e-06, 5.1666e+02, 9.2150e+00,
697 DIAGNOSTIC,Iteration,metricValue,convergenceValue,ITERATION_TIME_INDEX,SINCE_LAST
698  2DIAGNOSTIC,     1, -5.373499393463e-01, inf, 7.5350e+02, 2.3684e+02,
699   Elapsed time (stage 1): 8.1839e+02
700 
701 
702 Stage 2
703   iterations = 100x100x70x20
704   convergence threshold = 1.0000e-09
705   convergence window size = 10
706   number of levels = 4
707   using the CC metric (radius = 4, weight = 1.0000e+00, use gradient filter = 0)
708   preprocessing:  winsorizing the image intensities
709   preprocessing:  histogram matching the images
710   Shrink factors (level 1 out of 4): [6, 6, 6]
711   Shrink factors (level 2 out of 4): [4, 4, 4]
712   Shrink factors (level 3 out of 4): [2, 2, 2]
713   Shrink factors (level 4 out of 4): [1, 1, 1]
714   smoothing sigmas per level: [3, 2, 1, 0]
715   Using default NONE metricSamplingStrategy
716 
717 *** Running SyN registration (varianceForUpdateField = 3.0000e+00, varianceForTotalField = 0.0000e+00) ***
…
940  1DIAGNOSTIC,    70, -7.165402173996e-01, 3.239360012230e-05, 4.6647e+04, 5.3133e+02,
941 Using single precision for computations.
942 Input scalar image: /CACHE/tps/labdata/projects/AJ-110-DR/5dpfTemplate/240703.experiment-2/2819e-template/2819d.nii.gz
943 Reference image: TPS_template0.nii.gz
944 Can't read initial transform ./TPS_2819d30GenericAffine.mat
[end of file]

It looks to me like the affine registration has already failed. Is that correct?

The stdout-build-template.txt keeps going on but ends up on :

509 Transform file does not exist: ./TPS_2819d30GenericAffine.mat
511 Transform file does not exist: ./TPS_2856f00GenericAffine.mat
513 Transform file does not exist: ./TPS_968k60GenericAffine.mat
515 Transform file does not exist: ./TPS_2790i40GenericAffine.mat
517 Transform file does not exist: ./TPS_2822l50GenericAffine.mat
519 Transform file does not exist: ./TPS_2790k70GenericAffine.mat
…
4479 Using single precision for computations.
4480 Input scalar image: /CACHE/tps/labdata/projects/AJ-110-DR/5dpfTemplate/240703.experiment-2/2819e-template/2819e.nii.gz
4481 Reference image: TPS_template0.nii.gz
4482 =============================================================================
4483 The composite transform comprises the following transforms (in order):
4484   1. ./rigid12_0GenericAffine.mat (type = AffineTransform)
4485 =============================================================================
4486 Default pixel value: 0
4487 Interpolation type: LinearInterpolateImageFunction
4488 Output warped image: ./rigid12_0_2819e.nii.gz
4489 --------------------------------------------------------------------------------------
4490  shapeupdatetotemplate()
4491 --------------------------------------------------------------------------------------
4492 $1: dim: 3
4493 $2: template: TPS_template0.nii.gz
4494 $3: templatename: TPS_template
4495 $4: outputname: TPS_
4496 warped files: TPS_template0*WarpedToTemplate.nii.gz
4497 $5: gradientstep: -0.25
4498 $6: whichtemplate: 0
4499 $7: statsmethod: 1
4500 $8: sharpenmethod: 1
[end of file]

I took the freedom to attach the cited files, and I’d highly appreciate it if you could have a look at them and help me overcome this (new) hick-up (or should I rather start a new issue on this ?). job_0003_metriclog.txt stderr-build-template.txt stdout-build-template.txt

gdevenyi commented 4 months ago

Can we see the end of dmesg? Is this OOM or something else?

gdevenyi commented 4 months ago

You may be interested in https://github.com/CoBrALab/optimized_antsMultivariateTemplateConstruction which is a re-implementation with better error handing as well as better control of the parallelization of the processes.

cookpa commented 4 months ago

"Killed" with no explanation is generally memory, yes

gdevenyi commented 4 months ago

"Killed" with no explanation is generally memory, yes

On the head-node of my national HPC they kill processes that use more than a certain total amount of CPU time (you're not supposed to process on the head-node). This is another possibility.

cookpa commented 4 months ago

Yeah, wall time limits could also trigger that message. The usual tell for memory is that it consistently happens at the end of a stage, as the registration tries to allocate for the higher resolution.

ArnimJenett commented 4 months ago

Can we see the end of dmesg? Is this OOM or something else?

@gdevenyi, well caught! Thanks for chiming in. Here is the end of dmesg:

[4972973.615983] [1770887]     0 1770887     1876      433    53248       67             0 ANTSpexecTPS2.s
[4972973.615986] [1770899]     0 1770899      723      194    45056       27             0 sh
[4972973.615988] [1770901]     0 1770901 16890166 14569897 122376192   387745             0 antsRegistratio
[4972973.615990] [1770983]     0 1770983      723      190    45056       27             0 sh
[4972973.615992] [1770985]     0 1770985 17896565 15858742 129691648    13780             0 antsRegistratio
[4972973.615994] [1771067]     0 1771067      723      213    45056       27             0 sh
[4972973.615996] [1771069]     0 1771069 18222528 15967095 133554176   384409             0 antsRegistratio
[4972973.615998] [1771151]     0 1771151      723      193    49152       28             0 sh
[4972973.616000] [1771153]     0 1771153 15978290 12905419 108384256   304609             0 antsRegistratio
[4972973.616002] [1771235]     0 1771235      723      213    49152       27             0 sh
[4972973.616004] [1771237]     0 1771237 18015997 16253452 134443008   214095             0 antsRegistratio
[4972973.616007] [1771487]     0 1771487      723      183    45056       28             0 sh
[4972973.616008] [1771489]     0 1771489 17949291 15797375 132964352   482795             0 antsRegistratio
[4972973.616010] [1771739]     0 1771739      723      190    45056       27             0 sh
[4972973.616012] [1771741]     0 1771741 17470818 15722193 129409024   114326             0 antsRegistratio
[4972973.616014] [1771823]     0 1771823      723      189    45056       27             0 sh
[4972973.616016] [1771825]     0 1771825 15462806 13716113 112566272    16369             0 antsRegistratio
[4972973.616018] [1771907]     0 1771907      723      187    45056       23             0 sh
[4972973.616020] [1771909]     0 1771909 11940304 10241602 85774336   150525             0 antsRegistratio
[4972973.616023] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=user.slice,mems_allowed=0-1,global_oom,task_memcg=/user.slice/user-1000.slice/session-1742.scope,task=antsRegistratio,pid=1771237,uid=0
[4972973.616138] Out of memory: Killed process 1771237 (antsRegistratio) total-vm:72063988kB, anon-rss:65013808kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:131292kB oom_score_adj:0                                                                   
[4972980.251097] oom_reaper: reaped process 1771237 (antsRegistratio), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

However, that is not the only appearance of oom in dmesg. I extracted them ( dmesg |grep oom |tee 240705.oom.dmesg.txt ) and attached the resulting file here. For context, I also attached a complete dump of dmesg. 240705.dmesg.txt 240705.oom.dmesg.txt

Is there a clever way (which does not include downsampling) to overcome these memory issues? Maybe by limiting the number of processes run in parallel? Any thoughts I'd highly appreciate.

ArnimJenett commented 4 months ago

You may be interested in https://github.com/CoBrALab/optimized_antsMultivariateTemplateConstruction which is a re-implementation with better error handing as well as better control of the parallelization of the processes.

Thank you. Yes, at first glance, this appears to be a valid alternative. I will have a deeper look into it over the weekend.

ArnimJenett commented 4 months ago

"Killed" with no explanation is generally memory, yes

On the head-node of my national HPC they kill processes that use more than a certain total amount of CPU time (you're not supposed to process on the head-node). This is another possibility.

I am in the lucky situation, that I am all alone on this machine - no sharing with nobody. I, therefore, use the local (-c 2) option to run parallel executions (PEXEC). Since I have very limited experience with working on (shared) clusters, please allow me a potentially stupid question. Do I have to take care of head-nodes when using PEXEC and how would I do that?

Yeah, wall time limits could also trigger that message. The usual tell for memory is that it consistently happens at the end of a stage, as the registration tries to allocate for the higher resolution.

I was not aware, that the wall time has an impact on PEXEC, does it? I thought this is specific to PBS and/or SLURM.

cookpa commented 4 months ago

I'm running the first two images again, on my local Mac. With -c 2 -j 2, htop is showing both running and using about 6 Gb RAM each (using two threads for each).

It's possible that running with more threads will increase memory use.

Also, I believe I noticed before that the input images are different sizes (meaning the grid had a different number of voxels). Whatever is chosen as the initial template will determine the memory use, so it's worth checking the initial template is not too large.

I would run one of the pairwise registrations, eg job_0.sh, and monitor its memory use to get an idea of how much it needs.

gdevenyi commented 4 months ago

I wonder if the initial images are very misaligned and the first dumb average is making a huge image?

cookpa commented 3 months ago

I wonder if the initial images are very misaligned and the first dumb average is making a huge image?

This is possible if not using an initial template, but the commands in the OP suggest an initial template. But yes, bad misalignment during population averaging can create a very large template. Worth checking.