Closed hasan-alj88 closed 4 months ago
Hi, The kernel died message says that jupyter crashed for some reason and the crash reason couldn't be reported. One way to get a better error message is to run the python code directly from the terminal on the node. Then you will be able to get a more useful output.
I tried using termials. After "conda activate tf-2.13.0" it ran without an issue or kernel dying. Should I use Terminal from now? do I need to book a session?
I noticed that the GPU was not used when I ran in terminal. so kernel death is something to do with using the GPU. But to capture the problem?
That is because you are running on the master node probably...
When I said the terminal, I meant you either use the Jupyter Notebook terminal or user Slurm to get an allocation on one of the nodes and then use the terminal.
As for your issue, it is probably your process required more GPU ram than available. Either because someone else is using the GPU or because your dataset is too big. If it is the latter, you can work on distributing your training on multiple nodes.
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-PCIE-40GB Off | 00000000:21:00.0 Off | 0 |
| N/A 35C P0 55W / 250W | 39937MiB / 40960MiB | 8% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-PCIE-40GB Off | 00000000:81:00.0 Off | 0 |
| N/A 23C P0 32W / 250W | 4MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 523646 C python 39924MiB |
+---------------------------------------------------------------------------------------+
There is 40GB of RAM in the GPU which is more than enough to ran the model. I ran it in a different online server with 15GB GPU RAM only. Actually, before the issue, the model training is considerably faster which is why I perffer using this over others.
below the log from the terminal. I used tf.distribute.MirroredStrategy() and I confirmed that the model and variables using
print("Number of replicas:", self.distribute_strategy.num_replicas_in_sync)
Num GPUs Available: 2
Physical devices cannot be modified after being initialized
Number of replicas: 2
2024-02-26 17:54:26.968397: I tensorflow/tsl/profiler/lib/profiler_session.cc:104] Profiler session initializing.
2024-02-26 17:54:26.968454: I tensorflow/tsl/profiler/lib/profiler_session.cc:119] Profiler session started.
2024-02-26 17:54:26.973297: I tensorflow/compiler/xla/backends/profiler/gpu/cupti_tracer.cc:1679] Profiler found 2 GPUs
2024-02-26 17:54:27.111899: I tensorflow/tsl/profiler/lib/profiler_session.cc:131] Profiler session tear down.
2024-02-26 17:54:27.112048: I tensorflow/compiler/xla/backends/profiler/gpu/cupti_tracer.cc:1813] CUPTI activity buffer flushed
2024-02-26 17:54:27.224787: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype string and shape [10]
[[{{node Placeholder/_0}}]]
2024-02-26 17:54:27.225262: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype string and shape [10]
[[{{node Placeholder/_0}}]]
2024-02-26 17:54:27.277122: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype string and shape [10]
[[{{node Placeholder/_0}}]]
2024-02-26 17:54:27.277592: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype string and shape [10]
[[{{node Placeholder/_0}}]]
2024-02-26 17:54:27.418054: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype string and shape [10]
[[{{node Placeholder/_0}}]]
2024-02-26 17:54:27.418538: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype string and shape [10]
[[{{node Placeholder/_0}}]]
Epoch 1/100
2024-02-26 17:54:32.966986: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:424] Loaded cuDNN version 8904
2024-02-26 17:54:32.967708: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:424] Loaded cuDNN version 8904
2024-02-26 17:54:33.319764: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at conv_grad_ops_3d.cc:1992 : UNKNOWN: CUDNN_STATUS_BAD_PARAM
in tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc(3494): 'tensor' CUDNN_BACKEND_TENSOR_DESCRIPTOR: Check and Set the CUDNN_ATTR_TENSOR_DIMENSIONS Correctly
2024-02-26 17:54:33.319845: I tensorflow/core/common_runtime/executor.cc:1197] [/job:localhost/replica:0/task:0/device:GPU:1] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): UNKNOWN: CUDNN_STATUS_BAD_PARAM
in tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc(3494): 'tensor' CUDNN_BACKEND_TENSOR_DESCRIPTOR: Check and Set the CUDNN_ATTR_TENSOR_DIMENSIONS Correctly
[[{{node gradient_tape/replica_1/CNN_Unet_AutoEncoder/conv3d_transpose_10/conv3d_transpose/Conv3DBackpropFilterV2}}]]
2024-02-26 17:54:33.319889: I tensorflow/core/common_runtime/executor.cc:1197] [/job:localhost/replica:0/task:0/device:GPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): UNKNOWN: CUDNN_STATUS_BAD_PARAM
in tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc(3494): 'tensor' CUDNN_BACKEND_TENSOR_DESCRIPTOR: Check and Set the CUDNN_ATTR_TENSOR_DIMENSIONS Correctly
[[{{node gradient_tape/replica_1/CNN_Unet_AutoEncoder/conv3d_transpose_10/conv3d_transpose/Conv3DBackpropFilterV2}}]]
[[cond_1/then/_50/cond_1/cond/then/_628/cond_1/cond/cond/then/_676/cond_1/cond/cond/Identity_1/ReadVariableOp/_167]]
2024-02-26 17:54:33.319983: I tensorflow/core/common_runtime/executor.cc:1197] [/job:localhost/replica:0/task:0/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): UNKNOWN: CUDNN_STATUS_BAD_PARAM
in tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc(3494): 'tensor' CUDNN_BACKEND_TENSOR_DESCRIPTOR: Check and Set the CUDNN_ATTR_TENSOR_DIMENSIONS Correctly
[[{{node gradient_tape/replica_1/CNN_Unet_AutoEncoder/conv3d_transpose_10/conv3d_transpose/Conv3DBackpropFilterV2}}]]
[[cond_1/then/_50/cond_1/cond/then/_628/cond_1/cond/cond/then/_676/cond_1/cond/cond/Identity_1/ReadVariableOp/_167]]
[[div_no_nan/ReadVariableOp/_76]]
2024-02-26 17:54:33.325853: F ./tensorflow/core/util/gpu_launch_config.h:129] Check failed: work_element_count > 0 (0 vs. 0)
Aborted (core dumped)
From the nvidia-smi
output, I can see that GPU 0
is almost full, 39 GBs are used by another process. Start by directing your code to use the other GPU by passing the vacant device to the MirroredStrategy
constructor or set the following env var to the empty GPU.
From terminal:
export CUDA_VISIBLE_DEVICES=1
From Jupyter:
! export CUDA_VISIBLE_DEVICES=1
1 is the ID of the GPU
Then we will see what happens.
It did work (only in termianal).
Good to hear. It should work in the notebook too. Make sure you add the exclamation mark before the export in the jupyter notebook cell for it to work.
issue sloved
Error
At first, I freed up space by removing cashe files, then it started working. However, the kernel is starting to die again and this time, im sure its not because of the memory.
The kernel dies when I load the dataset from tfreacods directory. Note that the same dataset was loaded and trained the model even with no issues. kinldy check and advise.