kernel dying - Githubissues

hasan-alj88 commented 4 months ago

Error

Kernel Restarting
The kernel for DADv3/dad/AnomalyDectectionPipline.ipynb appears to have died. It will restart automatically.

At first, I freed up space by removing cashe files, then it started working. However, the kernel is starting to die again and this time, im sure its not because of the memory.

The kernel dies when I load the dataset from tfreacods directory. Note that the same dataset was loaded and trained the model even with no issues. kinldy check and advise.

asubah commented 4 months ago

Hi, The kernel died message says that jupyter crashed for some reason and the crash reason couldn't be reported. One way to get a better error message is to run the python code directly from the terminal on the node. Then you will be able to get a more useful output.

hasan-alj88 commented 4 months ago

I tried using termials. After "conda activate tf-2.13.0" it ran without an issue or kernel dying. Should I use Terminal from now? do I need to book a session?

hasan-alj88 commented 4 months ago

I noticed that the GPU was not used when I ran in terminal. so kernel death is something to do with using the GPU. But to capture the problem?

asubah commented 4 months ago

That is because you are running on the master node probably...

When I said the terminal, I meant you either use the Jupyter Notebook terminal or user Slurm to get an allocation on one of the nodes and then use the terminal.

asubah commented 4 months ago

As for your issue, it is probably your process required more GPU ram than available. Either because someone else is using the GPU or because your dataset is too big. If it is the latter, you can work on distributing your training on multiple nodes.

hasan-alj88 commented 4 months ago

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-PCIE-40GB          Off | 00000000:21:00.0 Off |                    0 |
| N/A   35C    P0              55W / 250W |  39937MiB / 40960MiB |      8%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-PCIE-40GB          Off | 00000000:81:00.0 Off |                    0 |
| N/A   23C    P0              32W / 250W |      4MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    523646      C   python                                    39924MiB |
+---------------------------------------------------------------------------------------+

There is 40GB of RAM in the GPU which is more than enough to ran the model. I ran it in a different online server with 15GB GPU RAM only. Actually, before the issue, the model training is considerably faster which is why I perffer using this over others.

below the log from the terminal. I used tf.distribute.MirroredStrategy() and I confirmed that the model and variables using

print("Number of replicas:", self.distribute_strategy.num_replicas_in_sync)

Num GPUs Available:  2
Physical devices cannot be modified after being initialized
Number of replicas: 2
2024-02-26 17:54:26.968397: I tensorflow/tsl/profiler/lib/profiler_session.cc:104] Profiler session initializing.
2024-02-26 17:54:26.968454: I tensorflow/tsl/profiler/lib/profiler_session.cc:119] Profiler session started.
2024-02-26 17:54:26.973297: I tensorflow/compiler/xla/backends/profiler/gpu/cupti_tracer.cc:1679] Profiler found 2 GPUs
2024-02-26 17:54:27.111899: I tensorflow/tsl/profiler/lib/profiler_session.cc:131] Profiler session tear down.
2024-02-26 17:54:27.112048: I tensorflow/compiler/xla/backends/profiler/gpu/cupti_tracer.cc:1813] CUPTI activity buffer flushed
2024-02-26 17:54:27.224787: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype string and shape [10]
         [[{{node Placeholder/_0}}]]
2024-02-26 17:54:27.225262: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype string and shape [10]
         [[{{node Placeholder/_0}}]]
2024-02-26 17:54:27.277122: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype string and shape [10]
         [[{{node Placeholder/_0}}]]
2024-02-26 17:54:27.277592: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype string and shape [10]
         [[{{node Placeholder/_0}}]]
2024-02-26 17:54:27.418054: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype string and shape [10]
         [[{{node Placeholder/_0}}]]
2024-02-26 17:54:27.418538: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype string and shape [10]
         [[{{node Placeholder/_0}}]]
Epoch 1/100
2024-02-26 17:54:32.966986: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:424] Loaded cuDNN version 8904
2024-02-26 17:54:32.967708: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:424] Loaded cuDNN version 8904
2024-02-26 17:54:33.319764: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at conv_grad_ops_3d.cc:1992 : UNKNOWN: CUDNN_STATUS_BAD_PARAM
in tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc(3494): 'tensor' CUDNN_BACKEND_TENSOR_DESCRIPTOR: Check and Set the CUDNN_ATTR_TENSOR_DIMENSIONS Correctly
2024-02-26 17:54:33.319845: I tensorflow/core/common_runtime/executor.cc:1197] [/job:localhost/replica:0/task:0/device:GPU:1] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): UNKNOWN: CUDNN_STATUS_BAD_PARAM
in tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc(3494): 'tensor' CUDNN_BACKEND_TENSOR_DESCRIPTOR: Check and Set the CUDNN_ATTR_TENSOR_DIMENSIONS Correctly
         [[{{node gradient_tape/replica_1/CNN_Unet_AutoEncoder/conv3d_transpose_10/conv3d_transpose/Conv3DBackpropFilterV2}}]]
2024-02-26 17:54:33.319889: I tensorflow/core/common_runtime/executor.cc:1197] [/job:localhost/replica:0/task:0/device:GPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): UNKNOWN: CUDNN_STATUS_BAD_PARAM
in tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc(3494): 'tensor' CUDNN_BACKEND_TENSOR_DESCRIPTOR: Check and Set the CUDNN_ATTR_TENSOR_DIMENSIONS Correctly
         [[{{node gradient_tape/replica_1/CNN_Unet_AutoEncoder/conv3d_transpose_10/conv3d_transpose/Conv3DBackpropFilterV2}}]]
         [[cond_1/then/_50/cond_1/cond/then/_628/cond_1/cond/cond/then/_676/cond_1/cond/cond/Identity_1/ReadVariableOp/_167]]
2024-02-26 17:54:33.319983: I tensorflow/core/common_runtime/executor.cc:1197] [/job:localhost/replica:0/task:0/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): UNKNOWN: CUDNN_STATUS_BAD_PARAM
in tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc(3494): 'tensor' CUDNN_BACKEND_TENSOR_DESCRIPTOR: Check and Set the CUDNN_ATTR_TENSOR_DIMENSIONS Correctly
         [[{{node gradient_tape/replica_1/CNN_Unet_AutoEncoder/conv3d_transpose_10/conv3d_transpose/Conv3DBackpropFilterV2}}]]
         [[cond_1/then/_50/cond_1/cond/then/_628/cond_1/cond/cond/then/_676/cond_1/cond/cond/Identity_1/ReadVariableOp/_167]]
         [[div_no_nan/ReadVariableOp/_76]]
2024-02-26 17:54:33.325853: F ./tensorflow/core/util/gpu_launch_config.h:129] Check failed: work_element_count > 0 (0 vs. 0)
Aborted (core dumped)

asubah commented 4 months ago

From the nvidia-smi output, I can see that GPU 0 is almost full, 39 GBs are used by another process. Start by directing your code to use the other GPU by passing the vacant device to the MirroredStrategy constructor or set the following env var to the empty GPU.

From terminal:

export CUDA_VISIBLE_DEVICES=1

From Jupyter:

! export CUDA_VISIBLE_DEVICES=1

1 is the ID of the GPU

Then we will see what happens.

hasan-alj88 commented 4 months ago

It did work (only in termianal).

asubah commented 4 months ago

Good to hear. It should work in the notebook too. Make sure you add the exclamation mark before the export in the jupyter notebook cell for it to work.

hasan-alj88 commented 4 months ago

issue sloved

UOB-AI / UOB-AI.github.io

kernel dying #51