Ability to estimate RAM usage from input parameters? (help with memory issues)

dorianps / LESYMAP

Lesion to Symptom Mapping in R

https://dorianps.github.io/LESYMAP/

Apache License 2.0

33 stars 11 forks source link

Ability to estimate RAM usage from input parameters? (help with memory issues) #15

Closed brussj closed 5 years ago

brussj commented 5 years ago

Dr. Pustina-

I'm trying to run LESYMAP using sccan but am running into memory limits and my jobs are unable to finish processing. My data is 2mm isotropic (91x109x91) using an initial 5% minimum threshold. I ran "sink()" in R to capture output before the session terminated:

12:49:06 Running LESYMAP 0.0.0.9202 12:49:06 Checking a few things... 12:49:06 Loading behavioral data...402 scores found. 12:49:11 Filenames as input, checking lesion values on 1st image... 12:49:11 SCCAN method: ignoring patch, nperm, and multiple comparison... 12:49:11 Searching voxels lesioned >= 5% subjects...31717 found 12:50:17 noPatch true - Patches will not be used... 12:50:17 Computing lesion matrix... 402x31717 12:50:25 Running analysis: sccan ... Searching for optimal sparseness: lower/upper bound: -0.9 / 0.9 cvRepetitions: 3 nFolds: 4 sparsenessPenalty: 0.03 optim tolerance: 0.03 12:50:26 Checking sparseness -0.212 ... CV correlation 0.246 (0.364) (cost=0.760) 13:01:32 Checking sparseness 0.212 ... CV correlation 0.250 (0.362) (cost=0.756) 13:12:37 Checking sparseness 0.475 ... CV correlation 0.254 (0.356) (cost=0.761) 13:24:11 Checking sparseness 0.121 ... CV correlation 0.244 (0.359) (cost=0.760) 13:35:39 Checking sparseness 0.313 ... CV correlation 0.255 (0.361) (cost=0.755) 13:48:01 Checking sparseness 0.357 ... CV correlation 0.253 (0.361) (cost=0.758) 14:00:24 Checking sparseness 0.276 ... CV correlation 0.254 (0.364) (cost=0.755) 14:13:16 Checking sparseness 0.293 ...

I'm running this on our high performance computing cluster and I'm tapping out the max memory allotment at 423 GB. I've reached out to the admins for the cluster and, unfortunately, this is the max memory I can use. They had asked if LESYMAP could or is written around support for MPI uage. I couldn't see in the documentation if it was or how I would even go about using that option.

Other options I can think to test:

1) Differing input thresholds to fit under my memory limitations. Is there a way to calculate, given the matrix (e.g. 402x31717), an estimate RAM usage? Do you know of a way in R to do this?

2) Given the sparseness values before the kill, set the lower/upperSpareseness ranges as a starting point. Given the above, do you have any recommendations?

I would prefer not to go this route as I would need to wait for a job to die before being able to hone in on a range later. It generally takes between 2-3 hours to either complete or terminate.

Does LESYMAP clear up RAM between sparseness checks or is it retained at each level? It looks to bounce up and down until it can figure a real range then looks to hone in on that result but if it's keeping the 3/4 to 1/4 results for every stage, that would eat up a lot of memory.

3) Is it possible to do sparseness checks at different resolutions? Similar to what, for example, antsRegistration does with convergence settings, could this downsample input masks into levels of less resolution to run sparseness checks, then use that result to inform the next level (now at a higher resolution) and so on up until full resolution?

Thanks in advance to any help you could offer me.

Thanks, Joel

dorianps commented 5 years ago

Hi Joel,

This problem is new and unusual, I haven't heard of memory issues before. You don't have a lot of points (one third of the default lesymap example which uses a matrix of 131 x 300,000), and should not need 400Gb to run SCCAN. Most of the time I have run it with 4-8 Gb. R usually takes care by itself of "garbage collection", but I have also placed calls to run garbage collection after each SCCAN iteration.

The only thing that comes in mind is some sort of unusual R installation, or old version. Can you report all versions of:

Infrastructure
OS
R
ANTsR (packageVersion('ANTsR'))
ANTsRCore (packageVersion('ANTsRCore'))
ITKR (packageVersion('ITKR'))

I still don't know how this might help, but let's have a look. To answer your questions:

No there is no memory estimate, calculating that goes beyond my knowledge, maybe @stnava knows. But again, your problem is unusual, most people get away with a laptop RAM.
Your analysis seem to converge between 0.25 and 0.35, you can try searching there. But this would be a temporary solution.
Hierarchical sparseness search at various resolution is an interesting idea but I don't see it's utility. It would require assumptions and tests to make sure that a sparseness found at low-res can simply be multiplicated by the voxel number increase in the next step. You may also miss small areas when starting at low resolution, which complicates the interpretation of the findings. And it would probably be a waste of time because most LESYMAP analyses already work within 2-24 hours. But if anyone want's to test that possibility, it is rather easy to try, just run lesymap twice using the second time sparsness x ratio of voxel increase in the matrix. I don't have time to try this now.

Dorian

brussj commented 5 years ago

Dorian-

Thank you for the prompt response. I'm not sure what you mean by infrastructure but can gladly report the other versions:

OS = CentOS-7.4 Linux R = 3.5.1 ANTsR = 0.4.7 ANTsRCore = 0.6.3 ITKR = 0.4.17.5

I'll try, in the meantime, to set lower/upperSparseness between 0.25 and 0.35 and see if it completes. If it does, I'll see what qacct tells me about memory usage.

-Joel

dorianps commented 5 years ago

Thanks Joel, I am trying it in an SGE environment too, with some bad_alloc failures at the first SCCAN run, but it's running now directly on the system without qlogin/qrsh.

Please paste also the memory failure you see on the screen eventually.

dorianps commented 5 years ago

You might be right on this bug. R keeps collecting RAM usage and is now at 110Gb on my system at the second round of sccan optimization. Maybe it has to do with recent changes in ITK/ANTsR. Will monitor this further and see where it goes.

brussj commented 5 years ago

Unfortunately, if I go over memory allocations I just get a "severed connection" message. I've been smart enough to write down my job IDs when I submit so that I can run "qacct -j jobid" to check maxvmem.

I think I may know what the issue is here. I think, like others have posted, I have a strange two-tailed distribution to my scores and, unfortunately, the (in this case) lower/better scores are what sccan is latching on to. While waiting for my job to complete (setting sparseness between 0.25 and 0.35) I decided to also run BMfast with two.sided. My results were as expected:

BMFast_output

I also plotted the errors (lower is better) for the subjects and it is indeed skewed towards the lower end:

WCST_PE_testedSubs

I'm currently running sccan two ways, one with lower/upperSparseness set between 0.25 and 0.35 and another forcing between -0.09 and -0.005. I'll report back tomorrow when they finish.

Related to this, in the "what's new" section, I think there are a few typos that I got tripped up on.

If you want to check this issue on your own data, run LESYMAP twice: one time with lowersparseness=-9; upperSparseness=-0.005 and one time with lowerSparseness=0.005; upperSparseness=0.9.

For the negative condition, should that be -0.9 and not -9? I'm assuming it should be -0.9 to match the (opposite sign) upperSparseness in the positive condition. Also, it would need to be lowerSparseness not lowersparseness. Without thinking much about it, I had copy/pasted and couldn't figure out why it was setting some other value until I realized that the capital "S" mattered.

Thanks for all your hard work on this package and for being so quick and helpful.

brussj commented 5 years ago

I'm crossing my fingers that it's hoarding RAM, coupled with (in my case) the fact that it looks to be grabbing the tail I don't want.

dorianps commented 5 years ago

Thanks for spotting the typos.

I don't think your data have anything to do with this, sounds like a bug introduced in recent times. I could replicate the issue in a CENTOS server with 250Gb memory.

> lesydata = file.path(find.package('LESYMAP'),'extdata')
> filenames = Sys.glob(file.path(lesydata, 'lesions', 'Subject*.nii.gz'))
> behavior = Sys.glob(file.path(lesydata, 'behavior', 'behavior.txt'))
> lsm = lesymap(filenames, behavior, method = 'sccan')
17:37:33 Running LESYMAP 0.0.0.9205
17:37:33 Checking a few things...
17:37:33 Loading behavioral data...131 scores found.
17:37:34 Filenames as input, checking lesion values on 1st image...
17:37:35 SCCAN method: ignoring patch, nperm, and multiple comparison...
17:37:35 Searching voxels lesioned >= 10% subjects...326828 found
17:38:00 noPatch true - Patches will not be used...
17:38:01 Computing lesion matrix... 131x326828
17:38:15 Running analysis: sccan ...
       Searching for optimal sparseness:
         lower/upper bound:       -0.9 / 0.9
         cvRepetitions:           3
         nFolds:                  4
         sparsenessPenalty:       0.03
         optim tolerance:         0.03
17:38:19        Checking sparseness -0.212 ... CV correlation 0.827 (0.846) (cost=0.180)
17:51:02        Checking sparseness 0.212 ... CV correlation 0.827 (0.846) (cost=0.180)
18:05:20        Checking sparseness 0.475 ...Killed

There might be something wrong with one of these things:

A change in ANTs for the underlying C function that runs sparseDecom
The switch to ITK v5 which is happening in the last few months.
An issue with R 3.5 garbage collection.
Some other issue with cmake or gcc compiler.

@stnava , any idea where is the memory overhead coming from, or what should I try first? Looks like is not in LESYMAP or ANTsR, neither of which have changed much, but in the underlying ANTs binaries.

stnava commented 5 years ago

I don’t think any of those things have changed much ... except itkv5.

Anyway I would try a complete source install from itkr onwards and see what happens.

On Mon, Mar 11, 2019 at 6:46 PM dorianps notifications@github.com wrote:

Thanks for spotting the typos.

I don't think your data have anything to do with this, sounds like a bug introduced in recent times. I could replicate the issue in a CENTOS server with 250Gb memory.

lesydata = file.path(find.package('LESYMAP'),'extdata') filenames = Sys.glob(file.path(lesydata, 'lesions', 'Subject*.nii.gz')) behavior = Sys.glob(file.path(lesydata, 'behavior', 'behavior.txt')) lsm = lesymap(filenames, behavior, method = 'sccan') 17:37:33 Running LESYMAP 0.0.0.9205 17:37:33 Checking a few things... 17:37:33 Loading behavioral data...131 scores found. 17:37:34 Filenames as input, checking lesion values on 1st image... 17:37:35 SCCAN method: ignoring patch, nperm, and multiple comparison... 17:37:35 Searching voxels lesioned >= 10% subjects...326828 found 17:38:00 noPatch true - Patches will not be used... 17:38:01 Computing lesion matrix... 131x326828 17:38:15 Running analysis: sccan ... Searching for optimal sparseness: lower/upper bound: -0.9 / 0.9 cvRepetitions: 3 nFolds: 4 sparsenessPenalty: 0.03 optim tolerance: 0.03 17:38:19 Checking sparseness -0.212 ... CV correlation 0.827 (0.846) (cost=0.180) 17:51:02 Checking sparseness 0.212 ... CV correlation 0.827 (0.846) (cost=0.180) 18:05:20 Checking sparseness 0.475 ...Killed

There might be something wrong with one of these things:

A change in ANTs for the underlying C function that runs sparseDecom

The switch to ITK v5 which is happening in the last few months.

An issue with R 3.5 garbage collection.

Some other issue with cmake or gcc compiler.

@stnava https://github.com/stnava , any idea where is the memory overhead coming from, or what should I try first? Looks like is not in LESYMAP or ANTsR, neither of which have changed much, but in the underlying ANTs binaries.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dorianps/LESYMAP/issues/15#issuecomment-471770233, or mute the thread https://github.com/notifications/unsubscribe-auth/AATyfq4d_JZ75gGyfyBNN-Gp9vr0Dzroks5vVtzHgaJpZM4bpfWr .

--

brian

brussj commented 5 years ago

I was thinking about my test scores incorrectly. The higher score = worse performance. When I re-ran sccan using 0.25-0.35 for lower/upperSparseness, I got a result similar to what I saw with BMfast, just smaller (BMFast in cool color, sccan in hot/pink color):

WCST_PE_forcedPos_output

Checking my sink() log:

17:18:32 Running LESYMAP 0.0.0.9202 
17:18:32 Checking a few things...
17:18:32 Loading behavioral data...402 scores found.
17:18:35 Filenames as input, checking lesion values on 1st image...
17:18:35 SCCAN method: ignoring patch, nperm, and multiple comparison...
17:18:35 Searching voxels lesioned >= 5% subjects...31717 found
17:19:40 noPatch true - Patches will not be used...
17:19:40 Computing lesion matrix... 402x31717
17:19:48 Running analysis: sccan ...
       Searching for optimal sparseness:
         lower/upper bound:   0.25 / 0.35
         cvRepetitions:       3
         nFolds:          4
         sparsenessPenalty:   0.03
         optim tolerance:     0.03
17:19:49        Checking sparseness 0.288 ... CV correlation 0.252 (0.366) (cost=0.756)
17:30:49        Checking sparseness 0.312 ... CV correlation 0.252 (0.365) (cost=0.757)
17:41:49        Checking sparseness 0.274 ... CV correlation 0.254 (0.367) (cost=0.754)
17:52:50        Checking sparseness 0.264 ... CV correlation 0.254 (0.365) (cost=0.754)
18:03:46        Checking sparseness 0.264 ... CV correlation 0.254 (0.365) (cost=0.754)
       Found optimal sparsenes 0.264 (CV corr=0.254 p=2.4e-07)
       Calling SCCAN with:
            Components = 1
            Use ranks = 1
            Sparseness = 0.264
            Cluster threshold = 150
            Smooth sigma = 0.4
            Iterations = 20
            maxBased = FALSE
            directionalSCCAN = TRUE
            Permutations = 0
18:15:49 Preparing images...
18:15:49 Logging call details...
18:15:49 Saving on disk...
18:15:49 Done! 57.3 mins

It took a bit less than an hour but checking my qacct report, used 272.64 GB or memory!

Forcing -0.9 to -0.005 nearly completed but died from lack of memory:

16:50:14        Checking sparseness -0.558 ... CV correlation 0.263 (0.354) (cost=0.753)
16:59:39        Checking sparseness -0.347 ... CV correlation 0.264 (0.364) (cost=0.746)
17:10:42        Checking sparseness -0.216 ... CV correlation 0.265 (0.367) (cost=0.742)
17:21:33        Checking sparseness -0.136 ... CV correlation 0.248 (0.358) (cost=0.756)
17:32:14        Checking sparseness -0.266 ... CV correlation 0.265 (0.367) (cost=0.743)
17:43:16        Checking sparseness -0.159 ... CV correlation 0.254 (0.361) (cost=0.750)
17:54:02        Checking sparseness -0.194 ... CV correlation 0.265 (0.363) (cost=0.741)
18:04:57        Checking sparseness -0.181 ... CV correlation 0.264 (0.361) (cost=0.742)
18:15:53        Checking sparseness -0.204 ...

At this point should I just sit tight? I could try a complete source install from itkr as Brian suggested but I'd have to poke my sysadmin and he gets grumpy over these things.

dorianps commented 5 years ago

Joel, you need to reinstall everything. Reinstalling fixed the issues in my system and SCCAN in LESYMAP does not take more than 2-4 Gb with the example below:

library(LESYMAP)

lesydata = file.path(find.package('LESYMAP'),'extdata')
filenames = Sys.glob(file.path(lesydata, 'lesions', 'Subject*.nii.gz'))
behavior = Sys.glob(file.path(lesydata, 'behavior', 'behavior.txt'))
lsm = lesymap(filenames, behavior, method = 'sccan')

Don't know why, but reinstalling LESYMAP, etc., this time required the V8-devel system package, which was missing in my case. You can check if it is installed in your system with yum list installed V8-devel.

Then you can reinstall packages in R with something like this:

remove.packages(c('LESYMAP', 'ANTsR', 'ANTsRCore', 'ITKR'))
install.packages('devtools') # if you don't have it already
devtools::install_github('dorianps/LESYMAP')

This worked fine in my workstation, and fixed the memory issue.

Important: set `ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS`

You need to set the environment variable ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS before running R. This is very important because all ITK based software can get confused in cluster or workstation environments, see this issue. If you don't set that variable, ITK software will try to read the number of CPU cores from the system itself, and will try to use all. This brings to all sorts of problems, sometimes you may not be able to even open a file (it happened to me last night). The problem may not be related to the bug that caused memory overhead in SCCAN, but it will most certainly be a problem if you don't set this variable. Start with:

export ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS=1

and try if SCCAN is working properly and doesn't consume much memory. I could run three SCCAN analyses simultaneously, one of which used 10 threads, and the overall memory consumption in the system was 12 Gb.

If you are using SGE, you need to set ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS within each job. You can ask your system admin to set it to $NSLOTS automatically, or set it yourself in your scripts.

Dorian

P.s. I will also need to fix some minor bugs in LESYMAP, looks like the code in lsm_sccan is missing some parts I wrote last year.

brussj commented 5 years ago

Dorian-

I'll see if I can get this reinstalled soon and will report back. Thanks for checking on your end.

Regarding number of threads, in my case of pulling one giant node offline, qlogin will default $NSLOTS to the number of cores set in the call (in this case, 56). Are you suggesting that I override that and force number of threads to 1? In your example listed, you set to 1 thread but then mention that one job used 10 threads. How did that happen?

-Joel

dorianps commented 5 years ago

I tried several things last night and forget the details now, but surely when I forgot to set ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS in a qlogin/qrsh job, it failed quickly, I could not even load an image. This might be because it hit the job memory limit (7Gb or 12Gb, I forget) while using all 64 available threads. Running the job with 10 ITK threads directly on the system (not an SGE job) worked fine anyway, didn't try SGE with 10 slots though.

Yes, you should set ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS=1 to test LESYMAP, then increase to 10, 20, or 56 if it works. The increase in slots might need an increase in memory as well. You can try my above example, it may take like 4 hours to complete in 1 slot but at least you know that it shouldn't take more than 3-4 Gb at that level.

The only thing you might need from your admin is installing system packages, like cmake, V8-devel, etc. Everything else you can install by yourself in R.

brussj commented 5 years ago

Dorian-

It will be a while before I can test out the reinstall, though I fully expect things to mirror your experience (RAM issues fixed). I have to wait for admins to install the V8 dependency before I can rebuild, hopefully sooner than later. When I finally get a rebuild I'll try re-running and will report back so that this can be closed. Thanks again for taking the time to look into and fix the issues I ran up against.

-Joel

dorianps commented 5 years ago

No rush on my side. We also found out a recent bug in ANTsR that was causing SCCAN results to be null. The bug is now fixed, I am now resintalling ANTsR/ANTsRCore/ITKR to verify it's fixed. You will need to install these packages as well to make sure the bug is gone.

Beside this, I am preparing some new changes to LESYMAP to fix some minor bugs and add some functionality. You shouldn't wait for this, it does not affect much what you are trying to do, just letting you ahead.

P.s. You can try installing the R packages yourself with devtools and see if it complains about missing V8-devel.

dorianps commented 5 years ago

Can confirm that the fix works. The problem this resolves is a text users might have seen in SCCAN:

Cluster thresholding removed all voxels.

dorianps commented 5 years ago

Fyi, SGE jobs indeed need more memory when more ITK threads are used, but not because of SCCAN, it just needs memory when spawning threads to open files.

This is the example I used each time:

library(LESYMAP)
lesydata = file.path(find.package('LESYMAP'),'extdata')
filenames = Sys.glob(file.path(lesydata, 'lesions', 'Subject*.nii.gz'))
behavior = Sys.glob(file.path(lesydata, 'behavior', 'behavior.txt'))
lsm = lesymap(filenames, behavior, method = 'sccan')

Tests using interactive jobs in SGE 8.1.9, CentOS Linux release 7.6.1810:

FAIL: 10 slots, 50 Gb memory, ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS=20 # since multithreading is enabled SUCCESS: 10 slots, 50 Gb memory, ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS=10 SUCCESS: 10 slots, 50 Gb memory, ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS=2 FAIL: 10 slots, 12 Gb memory, ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS=10 FAIL: 10 slots, 12 Gb memory, ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS=5 SUCCESS: 10 slots, 12 Gb memory, ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS=3 SUCCESS: 2 slots, 20 Gb memory, ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS=4

By success I mean that it did not fail early and SCCAN started.

When it fails, it cannot even load images, looks like this:

11:16:12 Running LESYMAP 0.0.0.9210
11:16:12 Checking a few things...
11:16:12 Loading behavioral data...131 scores found.
11:16:13 Filenames as input, checking lesion values on 1st image...
terminate called without an active exception

Example of successful run with 2 ITK threads, 10 SGE slots, 50Gb memory

> lesydata = file.path(find.package('LESYMAP'),'extdata')
> filenames = Sys.glob(file.path(lesydata, 'lesions', 'Subject*.nii.gz'))
> behavior = Sys.glob(file.path(lesydata, 'behavior', 'behavior.txt'))
> lsm = lesymap(filenames, behavior, method = 'sccan')
00:38:16 Running LESYMAP 0.0.0.9210
00:38:16 Checking a few things...
00:38:16 Loading behavioral data...131 scores found.
00:38:16 Filenames as input, checking lesion values on 1st image...
00:38:17 SCCAN method: ignoring patch, nperm, and multiple comparison...
00:38:17 Searching voxels lesioned >= 10% subjects...326828 found
00:38:43 noPatch true - Patches will not be used...
00:38:44 Computing lesion matrix... 131x326828
00:38:57 Running analysis: sccan ...
       Searching for optimal sparseness:
         lower/upper bound:       -0.9 / 0.9
         cvRepetitions:           3
         nFolds:                  4
         sparsenessPenalty:       0.03
         optim tolerance:         0.03
00:39:00        Checking sparseness -0.212 ... CV correlation 0.826 (0.845) (cost=0.180)
00:54:25        Checking sparseness 0.212 ... CV correlation 0.826 (0.845) (cost=0.180)
01:10:16        Checking sparseness 0.475 ... CV correlation 0.809 (0.829) (cost=0.205)
01:24:24        Checking sparseness 0 ... CV correlation 0.805 (0.841) (cost=0.195)
02:09:52        Checking sparseness 0.313 ... CV correlation 0.821 (0.839) (cost=0.188)
02:24:10        Checking sparseness 0.131 ... CV correlation 0.833 (0.852) (cost=0.171)
02:40:55        Checking sparseness 0.081 ... CV correlation 0.837 (0.857) (cost=0.165)
03:08:28        Checking sparseness 0.05 ... CV correlation 0.839 (0.859) (cost=0.162)
03:52:00        Checking sparseness 0.031 ... CV correlation 0.841 (0.860) (cost=0.160)
04:37:44        Checking sparseness 0.019 ... CV correlation 0.841 (0.861) (cost=0.160)
05:24:56        Checking sparseness 0.019 ... CV correlation 0.841 (0.861) (cost=0.160)
       Found optimal sparsenes 0.019 (CV corr=0.841 p=3.13e-36)
       Calling SCCAN with:
            Components:          1
            Use ranks:           1
            Sparseness:          0.019
            Cluster threshold:   150
            Smooth sigma:        0.4
            Iterations:          20
            maxBased:            FALSE
            directionalSCCAN:    TRUE
            optimizeSparseness:  TRUE
            validateSparseness:  FALSE
06:14:02 Preparing images...
06:14:02 Logging call details...
06:14:02 Done! 5.6 hours

dorianps / LESYMAP

Ability to estimate RAM usage from input parameters? (help with memory issues) #15

Important: set ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS

Tests using interactive jobs in SGE 8.1.9, CentOS Linux release 7.6.1810:

When it fails, it cannot even load images, looks like this:

Example of successful run with 2 ITK threads, 10 SGE slots, 50Gb memory

Important: set `ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS`