Closed iladrien-hub closed 2 months ago
Tensorflow 2.16 or newer uses Keras 3 instead of Keras 2, so it has some breaking changes for training API.
This issue is related to reset_metrics
which is removed on Keras 3. I fixed it by b96a1bb96, but not tested yet.
I fixed it by https://github.com/KichangKim/DeepDanbooru/commit/b96a1bb96dc7322bf04fc8d3ffbbcbc2f112b568, but not tested yet.
It seems to have helped - this particular error disappeared. I wanted to do something similar myself. However, at startup, the load on the gpu looks like peaks (short-term high load followed by 0% load). And after the first log with metrics, it displays “killed”, without a trace stack, and the process is terminated.
ah... also the speed looks very suspicious, 0.2 samples/s on the GPU vs. 0.9-1 samples/s on the CPU....
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1724745862.160106 8833 cuda_executor.cc:1001] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
I0000 00:00:1724745862.313694 8833 cuda_executor.cc:1001] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
I0000 00:00:1724745862.313800 8833 cuda_executor.cc:1001] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
I0000 00:00:1724745862.317874 8833 cuda_executor.cc:1001] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
I0000 00:00:1724745862.317956 8833 cuda_executor.cc:1001] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
I0000 00:00:1724745862.318031 8833 cuda_executor.cc:1001] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
I0000 00:00:1724745862.468660 8833 cuda_executor.cc:1001] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
I0000 00:00:1724745862.468864 8833 cuda_executor.cc:1001] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
I0000 00:00:1724745862.468976 8833 cuda_executor.cc:1001] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
Using Adam optimizer ...
Loading tags ...
Creating model (resnet_custom_v2) ...
Model : (None, 299, 299, 3) -> (None, 18970)
Using loss : binary_crossentropy
Loading database ...
No checkpoint. Starting new training ... (2024-08-27 11:04:26.575391)
Shuffling samples (epoch 0) ...
Trying to change learning rate to 0.001 ...
Learning rate is changed to <KerasVariable shape=(), dtype=float32, path=adam/learning_rate> ...
WARNING:tensorflow:From /home/iladrien/.local/lib/python3.10/site-packages/deepdanbooru/data/dataset_wrapper.py:31: ignore_errors (from tensorflow.python.data.experimental.ops.error_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.ignore_errors` instead.
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1724745945.999515 8918 service.cc:146] XLA service 0x7fa04c004080 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1724745945.999646 8918 service.cc:154] StreamExecutor device (0): NVIDIA GeForce RTX 3070 Laptop GPU, Compute Capability 8.6
I0000 00:00:1724746013.980505 8918 device_compiler.h:188] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.
WARNING:tensorflow:5 out of the last 5 calls to <function TensorFlowTrainer.make_train_function.<locals>.one_step_on_iterator at 0x7fa0a4364040> triggered tf.function retracing. Tracing is expensive and the excessive number of traci
ngs could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf
.function has reduce_retracing=True option that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for more details.
WARNING:tensorflow:6 out of the last 6 calls to <function TensorFlowTrainer.make_train_function.<locals>.one_step_on_iterator at 0x7fa0a4364040> triggered tf.function retracing. Tracing is expensive and the excessive number of traci
ngs could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf
.function has reduce_retracing=True option that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for more details.
Epoch[0] Loss=0.390290, P=0.001383, R=0.084507, F1=0.002721, Speed = 0.2 samples/s, 0.29 %, ETA = 2024-08-28 12:00:14
Killed
to fix your trouble try download this fix, i see it in another issue
Bro, this is very much a scam... I'm not going to install this without at least a link to an issue that explains the problem this thing solves and preferably the way it does it.
I reported and deleted spam comments.
Thank you)
regarding the issue... i had an idea to try to run this whole thing in VirtualBox... i don't know if it's a good idea, but in theory i can try to install any version of cuda there...
There are many compatibility issue related to latest Tensorflow for DeepDanbooru. I will fix it in the near future, but no ETA.
Fixed by 98d9315ab702e177de706825f9cee981c9afb928. If app is silently crashes, it may be GPU-related issue of tensorflow.
In case anyone else encounters the problem of the silent “Killed”... In my case, the cause was oom, I added a little bit of RAM and everything started working.
I get this error when trying to train a model.
Unfortunately, I can't use tensorflow below version 2.17.0, because it flatly refuses to detect gpu.
I suspect that I will have to downgrade the CUDA version to be compatible with tensorflow-2.7.0 and python-3.7... However, I still would not like to do this, because it is CUDA 11.2 and it is already four years old... it's a little bit of a pain in the ass to instal it......
So... I am writing this in the last hope to solve the problem differently.