Fails to run training notebook #2

natalieadye commented 1 year ago

Hi guys, I tried retraining our model and I'm running into memory allocation problems. Can you help?? Not sure if this is the proper place to ask, but I thought I'd try, since it worked before in a different environment.

ResourceExhaustedError Traceback (most recent call last) Cell In[46], line 2 1 median_size = calculate_extents(Y, np.median) ----> 2 fov = np.array(model._axes_tile_overlap('ZYX')) 3 print(f"median object size: {median_size}") 4 print(f"network field of view : {fov}")

File ~/miniconda3/envs/stardist-linux/lib/python3.8/site-packages/stardist/models/base.py:1084, in StarDistBase._axes_tile_overlap(self, query_axes) 1082 self._tile_overlap 1083 except AttributeError: -> 1084 self._tile_overlap = self._compute_receptive_field() 1085 overlap = dict(zip( 1086 self.config.axes.replace('C',''), 1087 tuple(max(rf) for rf in self._tile_overlap) 1088 )) 1089 return tuple(overlap.get(a,0) for a in query_axes)

File ~/miniconda3/envs/stardist-linux/lib/python3.8/site-packages/stardist/models/base.py:1069, in StarDistBase._compute_receptive_field(self, img_size) 1067 z = np.zeros_like(x) 1068 x[(0,)+mid+(slice(None),)] = 1 -> 1069 y = self.keras_model.predict(x)[0][0,...,0] 1070 y0 = self.keras_model.predict(z)[0][0,...,0] 1071 grid = tuple((np.array(x.shape[1:-1])/np.array(y.shape)).astype(int))

File ~/miniconda3/envs/stardist-linux/lib/python3.8/site-packages/keras/utils/traceback_utils.py:70, in filter_traceback..error_handler(*args, **kwargs) 67 filtered_tb = _process_traceback_frames(e.traceback) 68 # To get the full stack trace, call: 69 # tf.debugging.disable_traceback_filtering() ---> 70 raise e.with_traceback(filtered_tb) from None 71 finally: 72 del filtered_tb

File ~/miniconda3/envs/stardist-linux/lib/python3.8/site-packages/tensorflow/python/eager/execute.py:54, in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name) 52 try: 53 ctx.ensure_initialized() ---> 54 tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name, 55 inputs, attrs, num_outputs) 56 except core._NotOkStatusException as e: 57 if name is not None:

ResourceExhaustedError: Graph execution error:

SameWorkerRecvDone unable to allocate output tensor. Key: /job:localhost/replica:0/task:0/device:CPU:0;f411f7a4e10f780d;/job:localhost/replica:0/task:0/device:GPU:0;edge_33_IteratorGetNext;0:0 [[{{node IteratorGetNext/_2}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode. [Op:__inference_predict_function_2811]

thawn commented 1 year ago

Hi Natalie, thank you very much for posting this issue. Your computer is running out of graphics memory during training. The first thing I would try in this situation would be to reduce the batch size. However, if that does not help, you may need to tile your data into smaller tiles.

For a more detailed help, it would be great to have access to the code and some small sample data set. I have created a private repository for this on the TU Gitlab, where you can upload your code:

https://gitlab.mn.tu-dresden.de/bia-pol/stardist-training

I am also tagging @lazigu and @zoccoler , since they have more experience with stardist than me.

lazigu commented 1 year ago

Hi, Natalie (@natalieadye),

You mentioned that it worked before in a different environment without out-of-memory errors, could you confirm that the same dataset and Stardist parameters were used? If so, most likely some other processes are occupying a part of memory leaving Stardist less of it. As Till mentioned, the parameters that can be changed to avoid OOM errors are batch size (I typically used the batch size of 1 on a laptop with 32Gb RAM and 8Gb GPU) and patch size, which will determine into how many overlapping patches the data is tiled. Also, if previously Stardist was running on CPU and now on GPU it can result in OOM errors because GPUs typically have less memory.

zoccoler commented 1 year ago

Hi @natalieadye ,

I agree with @thawn , you may have to use a different batch size. Sending the whole notebook would be ideal.

If you are running the Stardist example notebooks, 3 cells above the one you get the error, you should find the batch size and patch size, like this:


    train_patch_size = (48,96,96),
    train_batch_size = 2,

Could you send what you have in your notebook? I would try to halve these numbers

natalieadye commented 1 year ago

Thanks all for your comments. @lazigu yes, that's correct - I used the exact same data and parameters in this notebook in a different environment and it worked previously, but not now. That's the troubling thing. to all: I tried reducing the patch size to 32,96,96 and the batch size to 1 --> still no good I uploaded to nataliesdata branch of the repository Till started for me above. Note - I just uploaded some test data - there are a lot more files in the train data but didn't think it was necessary to upload them all. Let me know if you want more.

natalieadye commented 1 year ago

Ahh, one more thing - I did install gpu-tools in this stardist-linux env: pip install gputools

thawn commented 1 year ago

I tried reducing the patch size to 32,96,96 and the batch size to 1 --> still no good

that indeed sounds like another process (maybe gputools) is using up the GPU memory.

looking through the training notebook, I did not find any obvious candidates, so I suspect it is some other process or notebook.

You can check the GPU memory usage in a terminal with nvidia-smi:

press ctrl+alt+t to open a terminal
type nvidia-smi

the output looks something like this:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.36.06    Driver Version: 450.36.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:84:00.0 Off |                    0 |
| N/A   30C    P8    26W / 175W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 | <== look here
+-----------------------------------------------------------------------------+

at the bottom (where I wrote <= look here), you will see a list of processes and how much GPU memory they use.

then shut down these processes (such as other notebooks, firefox or even other users that are logged in to the workstation)

Or just post the output here and we may be able to help you with choosing which process to shut down.

If all of the above does not help (e.g. because the process that occupies the memory is an important system process),

You can also change the following cell in the notebook:

if use_gpu:
    from csbdeep.utils.tf import limit_gpu_memory
    # adjust as necessary: limit GPU memory to be used by TensorFlow to leave some to OpenCL-based computations
    limit_gpu_memory(0.8,total_memory=24)
    # alternatively, try this:
    #limit_gpu_memory(None, allow_growth=True)

you could try to change 0.8 to 0.7 in the line limit_gpu_memory(0.8,total_memory=24) or comment the line out entirely and use the line limit_gpu_memory(None, allow_growth=True) instead.

natalieadye commented 1 year ago

Hi all. I had already suspected the same, but there was really nothing running on the GPU. In fact, I had restarted the computer several times just to make sure nothing was holding up the GPU - still no go. I just removed the old environment and started a fresh (WITHOUT gputools) and it works. So the problem was really with my installation of gputools - good to know!

thawn commented 1 year ago

This is indeed good to know. I recommend to write that info about gputools into the head of the training notebook (where you already have the information about closing other programs and logging out other users).

I am closing this issue as resolved.

zoccoler commented 1 year ago

I have some updates and I will probably create an issue in the stardist repository to check this, but first I need to understand what is actually the problem.

I believe this line in the notebook is wrong (Cell 10 of this notebook): use_gpu = False and gputools_available() This will always wield use_gpu to False.
Even with use_gpu being False, it looks like it uses the GPU, here is the output of my test on Windows while training: @thawn, can you check if this is right ? Maybe I can't see this in Windows (check this)?
I then added gputools to the .yml file, so that it installs with conda and changed the line mentioned above to use_gpu = True and gputools_available(), which now, in the environment with gputools, evaluates to True. Then, while training, I get the same output I also had to use this line limit_gpu_memory(None, allow_growth=True) instead of this limit_gpu_memory(0.8) in Cell 11, or limit_gpu_memory(0.8, total_memory=4096) in my case (cause it asks for total_memory).

To sum up, either everything is working without gputools and with use_gpu being False (weird...) or it is not using the GPU properly or not at all, then I would like to test a new .yml file that includes gputools.

@natalieadye if you notice something is still off, can you please send me an email so that I can test these options in place?

natalieadye commented 1 year ago

Well, it's strange, because in my 2 notebook https://gitlab.mn.tu-dresden.de/dyelabatpol/organoids/stardist3d/-/blob/main/2_training.ipynb, that configuration cell says "use_gpu = True and gputools_available()", so that's what I was doing and it wasn't working. (Maybe we changed this together last year when I first started using Stardist???). I didn't change the line about limit_gpu_memory, though.

When I now run the notebook without gpu_tools, the GPU is definitely occupied according to nvidia-smi, so i presume it's being used.

haesleinhuepf commented 1 year ago

Hi all,

just a minor side-note. I just had exactly the same error on my laptop. Pluging in an external GPU with more memory fixed the problem. When I run

from stardist import gputools_available
gputools_available()

The output is:

False

As I do NOT have installed gputools. My conda environment is very similar to Natalie's and thus, I'd say it's a purely internal StarDist problem, not related to gputools.

thawn commented 1 year ago

@zoccoler the output of nvidia-smi looks suspicious, because a the GPU usage is low (only 33-52%) and it does not show that it uses any GPU memory. On the other hand, I have only used nvidia-smi on mac/linux so far, so this behavior may be normal on windows.

zoccoler commented 1 year ago

@thawn , I was running on the default example dataset, that's why a low GPU percentage was there. I was curious about the N/A in GPU memory usage (which may be normal in Windows?).

@natalieadye yes, we probably changed that line at some point.

@haesleinhuepf thanks for these extra tests, they seem to show gpu gets used anyway.

I created an issue at the stardist repository here, let's look what the developers can indicate about it.

haesleinhuepf commented 1 year ago

A hint from Till @thawn might be worth a try. We can limit memory used by Tensorflow. This might help with various kinds of memory-related errors.

import pyclesperanto_prototype as cle
from csbdeep.utils.tf import limit_gpu_memory

# configure GPU
gpu_memory_in_mb = int(cle.get_device().device.global_mem_size / 1024 / 1024)
# adjust as necessary: limit GPU memory to be used by 
# TensorFlow to leave some to OpenCL-based computations
limit_gpu_memory(0.5, total_memory=gpu_memory_in_mb)

uschmidt83 commented 1 year ago

There seems to be quite some confusion about GPU use in general and the use_gpu = True and gputools_available() expression specifically.

StarDist uses TensorFlow, and TensorFlow is using the GPU if you installed it with GPU support (CUDA, cuDNN).
The use_gpu flag is only about GPU-based data generation for training (via OpenCL). Note that this is mentioned in the help and a comment in the notebook. (Using GPU-based data generation can in some cases speed up training substantially).
If use_gpu is true, we need to limit TensorFlow from grabbing all the GPU memory upfront (which it does by default). Hence the limit_gpu_memory stuff in our notebooks. If this isn't done, out of memory errors are to be expected.
I think our provided notebook contains use_gpu = False and gputools_available(), meaning that we disable use_gpu by default. If a user wants to enable it and uses use_gpu = True and gputools_available(), the and gputools_available() part acts as a "guard" to always disable this flag when gputools is not installed.

zoccoler commented 1 year ago

Thanks @uschmidt83 , I think that is clear now.

Then, I suspect we were getting OOM errors because we were running out of GPU memory at the data generation step, but not afterwards.

BiAPoL / stardist-envs

Fails to run training notebook #2 #4