Closed ArnimJenett closed 1 month ago
The pexec options are confusingly written, something I need to address. Running -c 2 -j 60
means it will try to run up to 60 parallel processes. If you have 12 input images, it will run a maximum of 12 processes to do all the pairwise registrations. However, it is not wise enough to know that it should assign 60/12 = 5 cores to each registration.
The per-process number of threads is controlled by the environment variable ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS
. If you want to use 60 cores total, I would do
export ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS=5
If ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS
is undefined, each registration might attempt to use all the CPUs on the system, which could increase the memory demands.
From your logs, I suspect it's failing when calling N4BiasFieldCorrection. Could you try setting the number of threads as above, and then running with -n 0
?
I found another potential problem
$ PrintHeader C1-201208Fc_968k_5dpf_AJ-110-DR_1024_Nh_merge-cc-crp.head.lr.nii.gz
Spacing [0.00245586, 0.00245586, 0.00245088]
Origin [0, 0, 0]
Direction
1 0 0
0 1 0
0 0 1
Size : 515 275 247
Image Dimensions : [515, 275, 247]
Bounding Box : {[0 0 0], [1.26477 0.67536 0.605367]}
Voxel Spacing : [0.00245586, 0.00245586, 0.00245088]
Intensity Range : [0, 4095]
Mean Intensity : 315.585
Direction Cos Mtx. :
1 0 0
0 1 0
0 0 1
Voxel->RAS x-form :
Image Metadata:
ITK_original_direction of unsupported type N3itk6MatrixIdLj3ELj3EEE
ITK_original_spacing of unsupported type NSt3__16vectorIdNS_9allocatorIdEEEE
ITK_sform_corrected = NO
bitpix = 16
cal_max = 4096
cal_min = 0
datatype = 4
dim[0] = 4
dim[1] = 515
dim[2] = 275
dim[3] = 247
dim[4] = 1
dim[5] = 1
dim[6] = 0
dim[7] = 0
dim_info = 0
intent_code = 0
intent_p1 = 0
intent_p2 = 0
intent_p3 = 0
nifti_type = 1
pixdim[0] = 0
pixdim[1] = 2.45586
pixdim[2] = 2.45586
pixdim[3] = 2.45088
pixdim[4] = 1
pixdim[5] = 0
pixdim[6] = 0
pixdim[7] = 0
qfac = 0
qform_code = 0
qform_code_name = NIFTI_XFORM_UNKNOWN
qoffset_x = 0
qoffset_y = 0
qoffset_z = 0
qto_xyz of unsupported type N3itk6MatrixIfLj4ELj4EEE
quatern_b = 0
quatern_c = 0
quatern_d = 0
scl_inter = 0
scl_slope = 1
sform_code = 0
sform_code_name = NIFTI_XFORM_UNKNOWN
slice_code = 0
slice_duration = 0
slice_end = 246
slice_start = 0
srow_x = 0 0 0 0
srow_y = 0 0 0 0
srow_z = 0 0 0 0
toffset = 0
vox_offset = 352
xyzt_units = 3
I see here that your xyzt_units=3, indicating microns. The first images have xyzt_units=0, which will default to being read as mm. The default parameters for N4 are set up for a human size brain.
I would advise setting all the headers consistently such that the pixel dimensions are read as 2.45 mm.
Thank you, @cookpa , for your quick an thorough explanation.
The ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS
was set already to 6 but I set to 5 anyhow and run the pexec with -n 0
as suggested.
Until now it appears to work (still running) as it should; even before fixing the images, which I am in the process of doing right now - for consistency sake.
Thanks a lot!
Great, I'm glad it's working.
Internally, ITK always works in mm coordinates, so images in other units should be converted at run time. In reality, this was not always done consistently (eg, I made some fixes here). So registrations that appear to work with different spatial units might break in the future when the ANTs ITK version is updated.
I think the approach taken in the first images, effectively scaling up the images so that the pixels are ~ 1mm, is the right thing to do. There's various numerical issues with very small spacings that are quite hard to debug, and many small animal imagers do this scaling to improve stability with ANTs and other tools.
I also noticed the input images were stored without qform or sform transforms. Reading such images relies on backwards compatibility with Analyze, and ITK will not write NIFTI images without a valid header transform - the output of registration will have a qform and sform. ITK will try to keep it consistent but you need to be careful that your template does not get flipped with respect to the input images, and that the input images have the same orientation. You can preview how ANTs interprets the image space by loading images into ITK-SNAP.
As it took me a little to read up on qform
and sform
I started a test-run with images consistently set to microns
(xyzt_units = 3). This approach failed producing the GenericAffine
artefacts so I am now running the same process on images (in original resolution) modified to yield
...
Voxel Spacing : [1.22793, 1.22793, 1.23]
...
qform_code_name = NIFTI_XFORM_UNKNOWN
...
sform_code_name = NIFTI_XFORM_UNKNOWN
...
xyzt_units = 2
... in PrintHeader and will report back here in the coming days.
I highly appreciate the help, thank you very much.
Yes, sorry if I was unclear, I think setting qform_code_name = NIFTI_XFORM_UNKNOWN
is the right way. Then the spacing will be interpreted as [1.22793, 1.22793, 1.23]
mm, which will help the registration.
Unfortunately, I can only report that the template-building based on 12 modified (units=mm) images didn’t run through either - but for a new reason, I guess. From stderr-build-template.txt :
57 Killed
58 Transform file does not exist: ./TPS_2819d30GenericAffine.mat
59 Killed
60 Transform file does not exist: ./TPS_2856f00GenericAffine.mat
61 Killed
62 Transform file does not exist: ./TPS_968k60GenericAffine.mat
63 Killed
64 Transform file does not exist: ./TPS_2790i40GenericAffine.mat
65 Killed
66 Transform file does not exist: ./TPS_2822l50GenericAffine.mat
67 Killed
68 Transform file does not exist: ./TPS_2790k70GenericAffine.mat
… which means the following jobs failed:
job_0000_metriclog.txt
job_0003_metriclog.txt
job_0004_metriclog.txt
job_0005_metriclog.txt
job_0006_metriclog.txt
job_0007_metriclog.txt
I find that remarkable, as these are not the last jobs (nor the first).
From job_0003_metriclog.txt (randomly selected from the failed jobs):
448 Total field sigma (voxel space) = 0
449 Update field time sigma = 0
450 Total field time sigma = 0
451 Number of time indices = 0
452 Number of time point samples = 0
453 Registration using 3 total stages.
454
455 Stage 0
456 iterations = 1000x500x250x0
457 convergence threshold = 1e-06
458 convergence window size = 10
459 number of levels = 4
460 using the Mattes MI metric (number of bins = 32, weight = 1, use gradient filter = 0)
461 preprocessing: winsorizing the image intensities
462 preprocessing: histogram matching the images
463 Shrink factors (level 1 out of 4): [6, 6, 6]
464 Shrink factors (level 2 out of 4): [4, 4, 4]
465 Shrink factors (level 3 out of 4): [2, 2, 2]
466 Shrink factors (level 4 out of 4): [1, 1, 1]
467 smoothing sigmas per level: [4, 2, 1, 0]
468 regular sampling (percentage = 0.25)
469
470 *** Running Euler3DTransform registration ***
471
472 DIAGNOSTIC,Iteration,metricValue,convergenceValue,ITERATION_TIME_INDEX,SINCE_LAST
473 2DIAGNOSTIC, 1, -3.822195529938e-01, inf, 3.3437e+01, 3.3437e+01,
…
586 2DIAGNOSTIC, 31, -5.463204383850e-01, 1.026196628118e-06, 5.8184e+02, 4.9209e+00,
587 DIAGNOSTIC,Iteration,metricValue,convergenceValue,ITERATION_TIME_INDEX,SINCE_LAST
588 2DIAGNOSTIC, 1, -5.288691520691e-01, inf, 7.2418e+02, 1.4234e+02,
589 Elapsed time (stage 0): 8.4275e+02
590
591
592 Stage 1
593 iterations = 1000x500x250x0
594 convergence threshold = 1.0000e-06
595 convergence window size = 10
596 number of levels = 4
597 using the Mattes MI metric (number of bins = 32, weight = 1.0000e+00, use gradient filter = 0)
598 preprocessing: winsorizing the image intensities
599 preprocessing: histogram matching the images
600 Shrink factors (level 1 out of 4): [6, 6, 6]
601 Shrink factors (level 2 out of 4): [4, 4, 4]
602 Shrink factors (level 3 out of 4): [2, 2, 2]
603 Shrink factors (level 4 out of 4): [1, 1, 1]
604 smoothing sigmas per level: [4, 2, 1, 0]
605 regular sampling (percentage = 2.5000e-01)
606
607 *** Running AffineTransform registration ***
608
609 DIAGNOSTIC,Iteration,metricValue,convergenceValue,ITERATION_TIME_INDEX,SINCE_LAST
610 2DIAGNOSTIC, 1, -6.035985946655e-01, inf, 1.7990e+01, 1.7990e+01
…
696 2DIAGNOSTIC, 32, -5.574958324432e-01, 1.005075887406e-06, 5.1666e+02, 9.2150e+00,
697 DIAGNOSTIC,Iteration,metricValue,convergenceValue,ITERATION_TIME_INDEX,SINCE_LAST
698 2DIAGNOSTIC, 1, -5.373499393463e-01, inf, 7.5350e+02, 2.3684e+02,
699 Elapsed time (stage 1): 8.1839e+02
700
701
702 Stage 2
703 iterations = 100x100x70x20
704 convergence threshold = 1.0000e-09
705 convergence window size = 10
706 number of levels = 4
707 using the CC metric (radius = 4, weight = 1.0000e+00, use gradient filter = 0)
708 preprocessing: winsorizing the image intensities
709 preprocessing: histogram matching the images
710 Shrink factors (level 1 out of 4): [6, 6, 6]
711 Shrink factors (level 2 out of 4): [4, 4, 4]
712 Shrink factors (level 3 out of 4): [2, 2, 2]
713 Shrink factors (level 4 out of 4): [1, 1, 1]
714 smoothing sigmas per level: [3, 2, 1, 0]
715 Using default NONE metricSamplingStrategy
716
717 *** Running SyN registration (varianceForUpdateField = 3.0000e+00, varianceForTotalField = 0.0000e+00) ***
…
940 1DIAGNOSTIC, 70, -7.165402173996e-01, 3.239360012230e-05, 4.6647e+04, 5.3133e+02,
941 Using single precision for computations.
942 Input scalar image: /CACHE/tps/labdata/projects/AJ-110-DR/5dpfTemplate/240703.experiment-2/2819e-template/2819d.nii.gz
943 Reference image: TPS_template0.nii.gz
944 Can't read initial transform ./TPS_2819d30GenericAffine.mat
[end of file]
It looks to me like the affine registration has already failed. Is that correct?
The stdout-build-template.txt keeps going on but ends up on :
509 Transform file does not exist: ./TPS_2819d30GenericAffine.mat
511 Transform file does not exist: ./TPS_2856f00GenericAffine.mat
513 Transform file does not exist: ./TPS_968k60GenericAffine.mat
515 Transform file does not exist: ./TPS_2790i40GenericAffine.mat
517 Transform file does not exist: ./TPS_2822l50GenericAffine.mat
519 Transform file does not exist: ./TPS_2790k70GenericAffine.mat
…
4479 Using single precision for computations.
4480 Input scalar image: /CACHE/tps/labdata/projects/AJ-110-DR/5dpfTemplate/240703.experiment-2/2819e-template/2819e.nii.gz
4481 Reference image: TPS_template0.nii.gz
4482 =============================================================================
4483 The composite transform comprises the following transforms (in order):
4484 1. ./rigid12_0GenericAffine.mat (type = AffineTransform)
4485 =============================================================================
4486 Default pixel value: 0
4487 Interpolation type: LinearInterpolateImageFunction
4488 Output warped image: ./rigid12_0_2819e.nii.gz
4489 --------------------------------------------------------------------------------------
4490 shapeupdatetotemplate()
4491 --------------------------------------------------------------------------------------
4492 $1: dim: 3
4493 $2: template: TPS_template0.nii.gz
4494 $3: templatename: TPS_template
4495 $4: outputname: TPS_
4496 warped files: TPS_template0*WarpedToTemplate.nii.gz
4497 $5: gradientstep: -0.25
4498 $6: whichtemplate: 0
4499 $7: statsmethod: 1
4500 $8: sharpenmethod: 1
[end of file]
I took the freedom to attach the cited files, and I’d highly appreciate it if you could have a look at them and help me overcome this (new) hick-up (or should I rather start a new issue on this ?). job_0003_metriclog.txt stderr-build-template.txt stdout-build-template.txt
Can we see the end of dmesg
? Is this OOM or something else?
You may be interested in https://github.com/CoBrALab/optimized_antsMultivariateTemplateConstruction which is a re-implementation with better error handing as well as better control of the parallelization of the processes.
"Killed" with no explanation is generally memory, yes
"Killed" with no explanation is generally memory, yes
On the head-node of my national HPC they kill processes that use more than a certain total amount of CPU time (you're not supposed to process on the head-node). This is another possibility.
Yeah, wall time limits could also trigger that message. The usual tell for memory is that it consistently happens at the end of a stage, as the registration tries to allocate for the higher resolution.
Can we see the end of
dmesg
? Is this OOM or something else?
@gdevenyi, well caught! Thanks for chiming in. Here is the end of dmesg:
[4972973.615983] [1770887] 0 1770887 1876 433 53248 67 0 ANTSpexecTPS2.s
[4972973.615986] [1770899] 0 1770899 723 194 45056 27 0 sh
[4972973.615988] [1770901] 0 1770901 16890166 14569897 122376192 387745 0 antsRegistratio
[4972973.615990] [1770983] 0 1770983 723 190 45056 27 0 sh
[4972973.615992] [1770985] 0 1770985 17896565 15858742 129691648 13780 0 antsRegistratio
[4972973.615994] [1771067] 0 1771067 723 213 45056 27 0 sh
[4972973.615996] [1771069] 0 1771069 18222528 15967095 133554176 384409 0 antsRegistratio
[4972973.615998] [1771151] 0 1771151 723 193 49152 28 0 sh
[4972973.616000] [1771153] 0 1771153 15978290 12905419 108384256 304609 0 antsRegistratio
[4972973.616002] [1771235] 0 1771235 723 213 49152 27 0 sh
[4972973.616004] [1771237] 0 1771237 18015997 16253452 134443008 214095 0 antsRegistratio
[4972973.616007] [1771487] 0 1771487 723 183 45056 28 0 sh
[4972973.616008] [1771489] 0 1771489 17949291 15797375 132964352 482795 0 antsRegistratio
[4972973.616010] [1771739] 0 1771739 723 190 45056 27 0 sh
[4972973.616012] [1771741] 0 1771741 17470818 15722193 129409024 114326 0 antsRegistratio
[4972973.616014] [1771823] 0 1771823 723 189 45056 27 0 sh
[4972973.616016] [1771825] 0 1771825 15462806 13716113 112566272 16369 0 antsRegistratio
[4972973.616018] [1771907] 0 1771907 723 187 45056 23 0 sh
[4972973.616020] [1771909] 0 1771909 11940304 10241602 85774336 150525 0 antsRegistratio
[4972973.616023] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=user.slice,mems_allowed=0-1,global_oom,task_memcg=/user.slice/user-1000.slice/session-1742.scope,task=antsRegistratio,pid=1771237,uid=0
[4972973.616138] Out of memory: Killed process 1771237 (antsRegistratio) total-vm:72063988kB, anon-rss:65013808kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:131292kB oom_score_adj:0
[4972980.251097] oom_reaper: reaped process 1771237 (antsRegistratio), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
However, that is not the only appearance of oom in dmesg. I extracted them ( dmesg |grep oom |tee 240705.oom.dmesg.txt
) and attached the resulting file here. For context, I also attached a complete dump of dmesg.
240705.dmesg.txt
240705.oom.dmesg.txt
Is there a clever way (which does not include downsampling) to overcome these memory issues? Maybe by limiting the number of processes run in parallel? Any thoughts I'd highly appreciate.
You may be interested in https://github.com/CoBrALab/optimized_antsMultivariateTemplateConstruction which is a re-implementation with better error handing as well as better control of the parallelization of the processes.
Thank you. Yes, at first glance, this appears to be a valid alternative. I will have a deeper look into it over the weekend.
"Killed" with no explanation is generally memory, yes
On the head-node of my national HPC they kill processes that use more than a certain total amount of CPU time (you're not supposed to process on the head-node). This is another possibility.
I am in the lucky situation, that I am all alone on this machine - no sharing with nobody.
I, therefore, use the local
(-c 2) option to run parallel executions (PEXEC
).
Since I have very limited experience with working on (shared) clusters, please allow me a potentially stupid question. Do I have to take care of head-nodes
when using PEXEC
and how would I do that?
Yeah, wall time limits could also trigger that message. The usual tell for memory is that it consistently happens at the end of a stage, as the registration tries to allocate for the higher resolution.
I was not aware, that the wall time has an impact on PEXEC
, does it? I thought this is specific to PBS
and/or SLURM
.
I'm running the first two images again, on my local Mac. With -c 2 -j 2, htop
is showing both running and using about 6 Gb RAM each (using two threads for each).
It's possible that running with more threads will increase memory use.
Also, I believe I noticed before that the input images are different sizes (meaning the grid had a different number of voxels). Whatever is chosen as the initial template will determine the memory use, so it's worth checking the initial template is not too large.
I would run one of the pairwise registrations, eg job_0.sh, and monitor its memory use to get an idea of how much it needs.
I wonder if the initial images are very misaligned and the first dumb average is making a huge image?
I wonder if the initial images are very misaligned and the first dumb average is making a huge image?
This is possible if not using an initial template, but the commands in the OP suggest an initial template. But yes, bad misalignment during population averaging can create a very large template. Worth checking.
Operating system and version
Ubuntu 22.04.4
CPU architecture
x86_64 (PC, Intel Mac, other Intel/AMD)
ANTs code version
2.5.1.post73-g3a788f8
ANTs installation type
Compiled from source
Summary of the problem
Hi All, I am trying to build a template calling
antsMultivariateTemplateConstruction2.sh
with 12 files... and after reading the first 5 files properly, I get 7 times a memory allocation error from ITK
... which I can not explain.
I saw this problem popping up in other issues (#917, #934) but the explanations there do not fit my situation. The system I am using has 72 cores and 512BG of RAM and the files are not overly big. From PrintHeader I computed this from one of the bigger files (.nii.gz)
I experimented with ANTSPexec.sh, implementing a memory limit to be checked before submitting another job. However, neither
free
norvmstat
show/detect low memory while/before itk throws the error.Can somebody please help me understand why I am getting these errors - and how to overcome them?
Commands to reproduce the problem.
Output of the command with verbose output.
stdout-build-template.txt stderr-build-template.txt
Data to reproduce the problem
data downloadable from https://filesender.renater.fr/?s=download&token=6c3fc7b4-b96a-4511-aa6c-c6617d4a6f51