RunCellPose not clearing GPU memory

rsenft1 commented 2 years ago

I'm running into memory errors when using RunCellPose on multiple large images. I will typically get an error that says the GPU is out of memory and no more can be allocated when processing the second image from a batch. Here are some things I've tried so far:

Clearing cache in conserve_memory in core torch.cuda.empty_cache()
- Specifically inserted in _pipeline.py in core in lines 1074 and ~1222. I tried just inserting it as well as putting it in an if gpu is available clause.
- Did not work
- Also, notable that runCellPose already does this near the end of the plugin. Using print statements in RunCellPose, we can see that very little memory is cleared here. I believe this is due to the variable taking up a bunch of memory (model) still sticking around.
Close the GPU (replacing cache clearing code)
- This caused everything to hang. After pressing Ctrl+C in terminal, things moved again but same memory error occurred when trying RunCellPose with the next image.
Add print statements to see GPU memory usage: torch.cuda.memory_reserved() and torch.cuda.memory_allocated()
- For some reason, these are not being printed though just nested within if torch.cuda.is_available():
- Maybe that doesn’t work for some reason…This could have been my error. Print statements within RunCellPose do work
- Tried again nesting print statements only under conserve_memory and still nothing is printing.
Tried running the above print statements in a separate command prompt while running the CP script
- After runCellPose finishes, both are 0
- However, even during CellPose, both are 0 so this isn’t helpful
Printing within RunCellPose shows memory is not being released. If I delete the model variable at line 283 in RunCellPose, we do get proper clearing of memory, but it errors when trying to run a second image through runCellpose, stating (cuDNN error: CUDNN_STATUS_INTERNAL_ERROR and the command prompt says it is unable to find a cuDNN algorithm
I also tried assigning the model variable to cpu with model = model.cpu() but this resulted in an exception and "Cellpose object has no attribute 'cpu'"

rsenft1 commented 2 years ago

Update: when inserting del model on line 284 of runcellpose, this does NOT cause the missing model error if only 1 worker is used. So it's something about multiple workers accessing runcellpose and maybe the model deletion for 1 worker disrupts other workers?

rsenft1 commented 2 years ago

@bethac07 How do multiple workers access plugins? Is it possible to delete a variable for only the current worker and not have that impact other workers? I'm kind of confused as to how the deletion of the model in one call to runCellPose is affecting other workers' runCellPose.

DavidStirling commented 2 years ago

@rsenft1 I may have some insight here. GPU memory was a consistent problem when developing the plugin, it's not clear to me whether multiple processes can share the same model object on the GPU. I had thought that each worker process will need it's own, but I didn't have enough GPU memory to store more than one model at a time anyway. CUDA also doesn't seem to like to release memory of it's own accord which is a bit annoying. Deleting the model after execution should free memory on the GPU, but this forces the model to be reloaded each time an image is processed which just isn't desirable.

I think at the moment each worker will try to load in the model seperately, but you might want to try establishing the model as a global variable within prepare_run, which will hopefully carry over into the workers without duplication.

rsenft1 commented 2 years ago

Thanks @DavidStirling That makes sense. When you say making it a global variable within prepare_run, do you mean make a new function in runcellpose.py analogous to prepare_run in other modules? I don't see any plugins using prepare_run.

DavidStirling commented 2 years ago

Yes, prepare_run executes before a run starts and so may carry forward to the workers. You could attach the model object to the module itself as something like self.ai_model and so potentially have that same object passed down to each worker.

It's a niche function but there's nothing different about plugins vs normal modules using it.

bethac07 commented 2 years ago

@rsenft1 did you ever try setting the maximum allowable memory fraction? https://pytorch.org/docs/stable/generated/torch.cuda.set_per_process_memory_fraction.html#torch.cuda.set_per_process_memory_fraction

bethac07 commented 2 years ago

(I'm trying it now)

rsenft1 commented 2 years ago

@bethac07 I did try that out, though not in that way. I tried some of the options people suggested here, though some required updating because that post was so old.

When multiple workers are running runCellPose at the same time, it’s unclear to me if the issue is the eval(model) step is too memory intensive and it crashes when two workers are trying to do it simultaneously, or if multiple workers accessing and deleting the model variable is what causes the issue. Also with multiple workers and deleting the model variable, clearing the cache doesn’t appear to release all memory like it does with 1 worker. The memory just keeps building.

Things I’ve tried:
- Tried locking the thread inside run() to force workers to go one at a time - still errors
- Tried doing a check for available memory and only allowing a worker to proceed if a sufficient fraction of memory is available. This didn't solve the issue and also sometimes workers would get stuck in endless loops of waiting if the cache couldn't be cleared.
- Tried defining the model in prepare_run. It appears this was not running before runCellPose though (not sure why or if code needs to be changed elsewhere outside of the plugin)
- Tried limiting GPU memory available to each worker
- Tried using tensor flow memory growth to not allocate all memory to each worker
- Tried making the model a global variable, also didn't solve issue

bethac07 commented 2 years ago

So running the Translocation set with a one module "RunCellPose" pipeline on a p2.xlarge with 4 workers, on average I'm getting a couple (1-4) of GPU allocation errors per 26-image-run. Those errors are stochastic as far as I can tell, and if you hit "continue" the set as a whole can complete.

By manually setting the memory fraction to 0.22, with the lines below inserted after the initial selection of the model, in 4 runs I've gotten 0 memory errors.

        if self.use_gpu.value and model.torch:
            #pass
            from torch import cuda
            cuda.set_per_process_memory_fraction(0.22)

This tells me a couple of things 1) It is definitely technically possible to run multiple workers with the module as is, it's not a lockout 2) When multiple workers are running, the errors you're seeing are likely due to each worker being greedy and trying to grab too much

When you only run one worker using the "vanilla" code, do I recall correctly that you do NOT get the memory error? Or at least, not right away?

rsenft1 commented 2 years ago

For me, the error always occurs on the second image and the first always passes fine when using multiple workers and vanilla code (or with deleting the model). When I hit continue to pass, the set can complete, but I believe any image after that first one is not actually processed/segmented. For 1 worker and vanilla code, I believe the situation is the same, but I am testing that now to be sure. If I run 1 worker with the additional 'del model' line, there are never memory errors.

rsenft1 commented 2 years ago

For setting the fraction, is that based on the number of workers running? If so, is there a way to get that information within runCellPose? That's great that it seems to fix it!

bethac07 commented 2 years ago

first always passes fine when using multiple workers and vanilla code (or with deleting the model)

Yeah, because only one workers runs at a time on the first image, no matter what.

For setting the fraction, is that based on the number of workers running? If so, is there a way to get that information within runCellPose? That's great that it seems to fix it!

It fixes it in 2D, it does not seem to fix it for 3D, in a test I just ran, aka it solves the issue when it's a "worker collision" issue, but not a "releasing the memory between image sets in the same worker" issue.

bethac07 commented 2 years ago

Actually, it does seem to be helping in a 3D set too - in 4 workers doing 13 identical copies of the 3D monolayer tutorial, I got I think 2 total of memory errors in 4 or 5 runs with the code I copied in above, whereas in the current code I got like, 8 in one run.

My suspicion as to how this works (NONE of which I am totally certain of, so salt accordingly, but it makes sense based on what else I know about python memory management) is that essentially, by saying "you're only allowed X amount of memory" to each worker, each worker is then handling its own caching and deleting of old files, so occasionally when an individual worker truly doesn't have enough memory for a given set, it's a problem. But if all workers are pulling from the same pool, an individual worker (which can only clear its own files/cache) has a much lower likelihood of being able to clear enough data from its own cache, since the other workers are greedy.

With respect to "can CellProfiler set that value automatically" - I'll have to take a look at it because I'm not sure individual workers can access information about the total number of workers, I suspect they cannot but am not sure. But it would be trivial to add another setting visible only in GPU mode of "how many workers do you plan to run this on" as a patch.

bethac07 commented 2 years ago

So you CAN get the number of workers from the preferences, at least in GUI mode. We still may want to set the "how many workers per GPU" (or "what fraction of the GPU should each worker be given access to") as an explicit setting though, because someone may want to headlessly run multiple workers on the same infrastructure, or use their GPU for something else.

rsenft1 commented 2 years ago

Yes I agree with it being an explicit setting.

With 1 worker I do not get memory errors, so the del model line is not necessary if you're already running just 1 worker.

When I tried your code, I do get a memory error on the first image, which suggests maybe the fraction allocated isn't big enough for my images. I tried 0.5 as well and that was also not enough. It seems that multiple workers might just not be suitable for this dataset given the specifics of the images, though I'm pretty confused about that, since it's operating on resized stacks that aren't huge. Why should a 41 slice 500 x 500 z stack require more than >8 GB of GPU memory to process?

bethac07 commented 2 years ago

It turned out the auto-setting was only working in test mode, not analysis mode, so you can try again now.

I can't figure out a nice way to do it automatically; in analysis mode, whether you're in the GUI or not, is_headless comes back True, and the settings in Preferences are ignored so there's no reliable way to find out the number of workers running. I tried using prepare_run to work around that, but it doesn't look like any changes that are made there persist in analysis mode - not global variables, not adding them as self., nor adding them to the workspace. It's possible that there IS a nice way to do it automatically, but I can't find it.

In any case, setting it manually shouldn't be TOO painful, so likely not worth wasting any more time on right now.

bethac07 commented 2 years ago

(One other thing I am not certain of, but may be true- I think the memory doesn't clear out nicely at the end of a TestMode run, so you may want to close and re-open CellProfiler between running test mode and running analysis mode (or set your memory stuff low enough that you each worker + the test mode job all adds up to <1)

rsenft1 commented 2 years ago

setting it manually

Unfortunately this doesn't work for the dataset I'm looking at. I'm closing and reopening CellProfiler each time and I'm only running analysis mode. I still get errors like this: In this case, I'm running 3 workers each with a 0.3 fraction of the GPU.

rsenft1 commented 2 years ago

What's strange is that if I print out the total and reserved memory, it seems like it's not allocating very much and there should be a bunch free.

DavidStirling commented 2 years ago

I did have a look at this, and it does seem like there's no simple way to avoid duplication of the model memory on the GPU. It looks like Torch doesn't guarantee proper threaded GPU memory release to begin with (at least until a process exits), which would explain why we see inconsistent problems with freeing memory even after deleting the model object.

Providing the memory share setting seems like a reasonable move here, though that seems to act like a suggestion rather than an actual limit. Perhaps it's worth simply warning in the docs that you'll need lots of memory to run multiple workers.

To solve this more robustly you'd want to keep a single model in shared memory. The key issue for CellProfiler is that sharing memory on the GPU requires the use of Torch's multiprocessing API:

https://pytorch.org/docs/stable/notes/multiprocessing.html

Unfortunately CP uses it's own custom multithreading solution which is both complicated and a rather outdated. In an ideal world that functionality could probably be replaced with a much cleaner setup but it'd require quite a bit of work. Models built in prepare_run or globals won't work here because the worker system doesn't fork the CP process: it starts a fresh process from scratch. Perhaps you could have a custom singleton class that saves a pointer to a file or passes it through the queue, but we've no idea if that'll work and it'd probably be more troublesome than replacing the existing worker system would be to begin with.

CellProfiler / CellProfiler-plugins

RunCellPose not clearing GPU memory #135