Closed muratmaga closed 3 years ago
Looks to me like you are out of RAM (or whatever RAM WSL thinks has available). How much memory do you have on the machine? If you are using WSL (or WSL2), can you check how much RAM and swap disk space you have? I used to increase those, and even increase the Windows virtual memory to make things work.
On Fri, Oct 1, 2021 at 12:38 PM muratmaga @.***> wrote:
We are trying to process some data from Mouse Phenotyping project. It involves transfering a segmented atlas to new samples. Unprocessed images do not have the correct voxel spacing, so we use an initial lm transform to scale them correctly using five points and then proceed other registration.
For some of the dataset this pipeline works fine, and for some it results at the syn2 with this error. Verbose mode doesn't provide anymore information than this. I know that I do did not resample the syn1 transforms to the higher resolution, but this clearly doesn't cause an issue for some of the samples and for this specific sample I did resample the warp field to high resolution, and still it crashed.
At this point I don't really know how can I debug this further. I appreciate any suggestions, I am providing a link to the zip archive for the sample data that reproduces the problem. https://app.box.com/s/grtsdi0w2edbn0ptw834l8vovd2v8n9t
caught segfault address 0x2aac22bbc970, cause 'memory not mapped'
Traceback: 1: antsRegistration(fixed = ref.img, moving = tmp.img, typeofTransform = "SyN", initialTransform = syn1$fwdtransforms, verbose = FALSE)
Possible actions: 1: abort (with core dump, if enabled) 2: normal R exit 3: exit R without saving workspace 4: exit R saving workspace
if (!require(SlicerMorphR)) devtools::install_github("SlicerMorph/SlicerMorphR") library(patchMatchR) library(ANTsR)
path to zip archive
setwd("~/Documents/memory_map/Data")
set.seed(5) ref.lm = read.markups.json("Atlas_lms.mrk.json") ref.img = antsImageRead("final.nrrd") ref.img = iMath(ref.img, "Normalize") ref.img.low = resampleImage(ref.img, c(0.054, 0.054, 0.054)) ref.label = antsImageRead("Embryo_Atlas-labels.nrrd")
tmp.img = antsImageRead('Female_ABCM_K1403-70-e15.5_baseline.nrrd') tmp.img = iMath(tmp.img, "Normalize") tmp.img.low = resampleImage(tmp.img, antsGetSpacing(tmp.img)*2) tmp.lm = read.markups.json('Female_ABCM_K1403-70-e15.5_baseline_lms.mrk.json')
lm.tx = fitTransformToPairedPoints(fixedPoints = ref.lm, movingPoints = tmp.lm, transformType = "Similarity", domainImage = ref.img)
affine = antsRegistration(fixed=ref.img.low, moving = tmp.img.low, typeofTransform = "Rigid", initialTransform = lm.tx$transform) syn1 = antsRegistration(fixed=ref.img.low, moving = tmp.img.low, typeofTransform = "SyN", initialTransform = affine$fwdtransforms)
at this point resultant registration seems reasonable:
new.labels = antsApplyTransforms(fixed=tmp.img, moving = ref.label, transformlist = syn1$invtransforms, interpolator = "genericLabel")
plot(tmp.img, new.labels, nslices=20, alpha=0.8)
crashes here
syn2 = antsRegistration(fixed=ref.img, moving = tmp.img, typeofTransform = "SyN", initialTransform = syn1$fwdtransforms, verbose=TRUE)
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ANTsX/ANTsR/issues/355, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACFJU7PELKVSO6HCHN23EULUEXPXDANCNFSM5FFCITSA .
This is on a physical Centos7 machine. RAM does not come close to running out (there are 256GB available on these nodes). When completes, R uses about 15-18GB tops for these registration operations.
Ehm, not sure what is going on. Let's see what others say. I have a system very similar to yours and could try your commands later if needed.
On Fri, Oct 1, 2021, 1:09 PM muratmaga @.***> wrote:
This is on a physical Centos7 machine. RAM does not come close to running out (there are 256GB available on these nodes). R, when completes, uses about 15-18GB tops.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ANTsX/ANTsR/issues/355#issuecomment-932403347, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACFJU7PYSECM5J7CBYUZKSLUEXTMTANCNFSM5FFCITSA .
Ehm, not sure what is going on. Let's see what others say. I have a system very similar to yours and could try your commands later if needed.
When you have the chance, that will be great. It seems to replicate for me on different systems.
I found a system with R 4.0.5 with similar versions, and can't seem to replicate. So I will use the newer R.
Spoke too soon. Same error on 4.0.5 Open to any other suggestions.
have you run this same thing single threaded?
I can try but will take very long. It is through Sys.setenv("ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS=1") right?
yes
at the very beginning of the script - or in the envt
Single threaded crashed on R 3.6.0 with the same error, now trying 4.0.5.
This is still running (it has been about 12hs). Normally at the syn2, if it crashes, it does it so early on. With single threaded, it is hard to tell where we are. Assuming this will work, what else can be done? Single-threaded is too slow the samples we have, will probably a day (or possibly longer) to process a single sample.
Confirmed that single threaded version finished successfully after 4 days.
In the multi-threaded context, how many threads is it trying to use? In other words, what happens if you print
Sys.getenv("ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS")
These are private compute nodes, so it is using maximum number of threads available on the node. Depending on the node it is running, this varies from 24 to 48.
OK, that is a lot of threads, it might be hitting a system memory limit either on the machine itself (ulimit -v) or by a job scheduler.
I would try running with 8 threads. If that works consistently you could try increasing to 16. I often don't see much improvement beyond 8, but that might be data dependent.
We are not using any scheduler itself right now. We manually (and directly) run the scripts on the compute node(s). SO we can rule out the job scheduler.
I can try with 8 threads. But for clarification, we have been doing this (running maximum number of threads in ANTsR) for a while for different datasets. This is the first time we encountered this memory not mapped issue.
I agree that knowing the number of cpu cores and containing the threads used by ANTs is the best approach. Running the maximum threads available can be confusing; the number of threads depend if hyperthreading is enabled (48 cpu cores can become 96 available threads). And if you go on a node outside the job scheduler, maybe the scheduler keeps sending jobs of other people to that node which can conflict.
On my 32 core system (with hyperthreading enabled) and 250Gb memory, I
would run 30 registration jobs, each with 2 threads, to exploit the
resources in full. You can also check how many threads and memory a single
registration job using top
or htop
.
On Wed, Oct 6, 2021 at 2:13 PM SlicerMorph Project @.***> wrote:
We are not using any scheduler itself right now. We manually (and directly) run the scripts on the compute node(s). SO we can rule out the job scheduler.
I can try with 8 threads. But for clarification, we have been doing this (running maximum number of threads in ANTsR) for a while for different datasets. This is the first time we encountered this memory not mapped issue.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ANTsX/ANTsR/issues/355#issuecomment-936827018, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACFJU7NUOXPSDJGT2MSFE33UFSGUPANCNFSM5FFCITSA .
Thanks for the feedback. We already know the number of threads and cores on the compute nodes, which is set by the environmental variable for each compute node through:
export ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS=cat /proc/cpuinfo | grep processor | wc -l
We have three types of compute nodes with 12, 20 and 24 physical cores and hyperthreading is enabled on all of them (so the reported threads are twice as many of the physical cores). Hence the reason for setting the variable like this.
We don't use scheduler (for this particular case), because it is a private system. Nobody else executes anything on them (apart from baselien OS processes). We have used antsMultivariateTemplateConstruction2.sh script previously on this exact system (which runs sge), with similar type of datasets, while utilizing all cores/threads on the system for a single registration. The reason we do that is because our microCT datasets are one or two order of magnitude larger than typical medical volumes. Thus, if you run it on 1-2 cores, it takes days to finish a single dataset (example above took 4 days, or 92h to be precise). Plus due to large memory consumption of these datasets at later stages of registration, if you go down few cores, but multiple simultaneous tasks per node, you are bound to run of out memory on the node, crashing all your tasks.
So far, one registration task per node that utilizes all threads paradigm worked well for us, until this particular dataset.
One more reflection:
For the jobs that do indeed complete using the all threads, this registration, takes about 4h (volume sizes are quite similar). Compared to the single threaded version, speed up is about 23 X, which is almost the number of physical cores. I will try either disabling the hyperthreading or just running as many threads as there are cores.
I wasn't able to debug any further, but the two level registration is the cause of the problem. This memory not mapped does not happen at all, regardless of the number of threads of resource. So, instead of doing:
affine = antsRegistration(fixed=ref.img.low, moving = tmp.img.low, typeofTransform = "Rigid", initialTransform = lm.tx$transform)
syn1 = antsRegistration(fixed=ref.img.low, moving = tmp.img.low, typeofTransform = "SyN", initialTransform = affine$fwdtransforms)
#at this point resultant registration seems reasonable:
new.labels = antsApplyTransforms(fixed=tmp.img, moving = ref.label,
transformlist = syn1$invtransforms,
interpolator = "genericLabel")
plot(tmp.img, new.labels, nslices=20, alpha=0.8)
#crashes here
syn2 = antsRegistration(fixed=ref.img, moving = tmp.img, typeofTransform = "SyN", initialTransform = syn1$fwdtransforms, verbose=TRUE)
I go straight to syn2 from affine, crash doesn't seem to happen.
affine = antsRegistration(fixed=ref.img.low, moving = tmp.img.low, typeofTransform = "Rigid", initialTransform = lm.tx$transform)
syn2 = antsRegistration(fixed=ref.img, moving = tmp.img, typeofTransform = "SyN", initialTransform = affine$fwdtransforms, verbose=TRUE)
We can live with this, but our test data showed that we got better results in terms of label overlap, and landmark localization with the two step approach. So, if anyone else have other ideas, would be happy to hear
We are trying to process some data from Mouse Phenotyping project. It involves transfering a segmented atlas to new samples. Unprocessed images do not have the correct voxel spacing, so we use an initial lm transform to scale them correctly using five points and then proceed other registration.
For some of the dataset this pipeline works fine, and for some it results at the syn2 with this error. Verbose mode doesn't provide anymore information than this. I know that I do did not resample the syn1 transforms to the higher resolution, but this clearly doesn't cause an issue for some of the samples and for this specific sample I did resample the warp field to high resolution, and still it crashed.
At this point I don't really know how can I debug this further. I appreciate any suggestions, I am providing a link to the zip archive for the sample data that reproduces the problem.
edit. This is the sessionInfo(), if it is useful: