KichangKim / DeepDanbooru

AI based multi-label girl image classification system, implemented by using TensorFlow.
MIT License
2.63k stars 260 forks source link

TensorFlowTrainer.train_on_batch() got an unexpected keyword argument 'reset_metrics' #107

Closed iladrien-hub closed 2 months ago

iladrien-hub commented 2 months ago

I get this error when trying to train a model.

Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/iladrien/.local/lib/python3.10/site-packages/deepdanbooru/__main__.py", line 247, in <module>
    main()
  File "/home/iladrien/.local/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/iladrien/.local/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/iladrien/.local/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/iladrien/.local/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/iladrien/.local/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/iladrien/.local/lib/python3.10/site-packages/deepdanbooru/__main__.py", line 106, in train_project
    dd.commands.train_project(project_path, source_model)
  File "/home/iladrien/.local/lib/python3.10/site-packages/deepdanbooru/commands/train_project.py", line 233, in train_project
    step_result = model.train_on_batch(
TypeError: TensorFlowTrainer.train_on_batch() got an unexpected keyword argument 'reset_metrics'

Unfortunately, I can't use tensorflow below version 2.17.0, because it flatly refuses to detect gpu.

I suspect that I will have to downgrade the CUDA version to be compatible with tensorflow-2.7.0 and python-3.7... However, I still would not like to do this, because it is CUDA 11.2 and it is already four years old... it's a little bit of a pain in the ass to instal it......

So... I am writing this in the last hope to solve the problem differently.

KichangKim commented 2 months ago

Tensorflow 2.16 or newer uses Keras 3 instead of Keras 2, so it has some breaking changes for training API.

This issue is related to reset_metrics which is removed on Keras 3. I fixed it by b96a1bb96, but not tested yet.

iladrien-hub commented 2 months ago

I fixed it by https://github.com/KichangKim/DeepDanbooru/commit/b96a1bb96dc7322bf04fc8d3ffbbcbc2f112b568, but not tested yet.

It seems to have helped - this particular error disappeared. I wanted to do something similar myself. However, at startup, the load on the gpu looks like peaks (short-term high load followed by 0% load). And after the first log with metrics, it displays “killed”, without a trace stack, and the process is terminated.

ah... also the speed looks very suspicious, 0.2 samples/s on the GPU vs. 0.9-1 samples/s on the CPU....

WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1724745862.160106    8833 cuda_executor.cc:1001] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
I0000 00:00:1724745862.313694    8833 cuda_executor.cc:1001] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
I0000 00:00:1724745862.313800    8833 cuda_executor.cc:1001] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
I0000 00:00:1724745862.317874    8833 cuda_executor.cc:1001] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
I0000 00:00:1724745862.317956    8833 cuda_executor.cc:1001] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
I0000 00:00:1724745862.318031    8833 cuda_executor.cc:1001] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
I0000 00:00:1724745862.468660    8833 cuda_executor.cc:1001] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
I0000 00:00:1724745862.468864    8833 cuda_executor.cc:1001] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
I0000 00:00:1724745862.468976    8833 cuda_executor.cc:1001] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
Using Adam optimizer ... 
Loading tags ...
Creating model (resnet_custom_v2) ...
Model : (None, 299, 299, 3) -> (None, 18970)
Using loss : binary_crossentropy
Loading database ... 
No checkpoint. Starting new training ... (2024-08-27 11:04:26.575391)
Shuffling samples (epoch 0) ...
Trying to change learning rate to 0.001 ...
Learning rate is changed to <KerasVariable shape=(), dtype=float32, path=adam/learning_rate> ...
WARNING:tensorflow:From /home/iladrien/.local/lib/python3.10/site-packages/deepdanbooru/data/dataset_wrapper.py:31: ignore_errors (from tensorflow.python.data.experimental.ops.error_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.ignore_errors` instead.
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1724745945.999515    8918 service.cc:146] XLA service 0x7fa04c004080 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1724745945.999646    8918 service.cc:154]   StreamExecutor device (0): NVIDIA GeForce RTX 3070 Laptop GPU, Compute Capability 8.6
I0000 00:00:1724746013.980505    8918 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
WARNING:tensorflow:5 out of the last 5 calls to <function TensorFlowTrainer.make_train_function.<locals>.one_step_on_iterator at 0x7fa0a4364040> triggered tf.function retracing. Tracing is expensive and the excessive number of traci
ngs could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf
.function has reduce_retracing=True option that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for  more details.
WARNING:tensorflow:6 out of the last 6 calls to <function TensorFlowTrainer.make_train_function.<locals>.one_step_on_iterator at 0x7fa0a4364040> triggered tf.function retracing. Tracing is expensive and the excessive number of traci
ngs could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf
.function has reduce_retracing=True option that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for  more details.
Epoch[0] Loss=0.390290, P=0.001383, R=0.084507, F1=0.002721, Speed = 0.2 samples/s, 0.29 %, ETA = 2024-08-28 12:00:14
Killed
iladrien-hub commented 2 months ago

to fix your trouble try download this fix, i see it in another issue

Bro, this is very much a scam... I'm not going to install this without at least a link to an issue that explains the problem this thing solves and preferably the way it does it.

KichangKim commented 2 months ago

I reported and deleted spam comments.

iladrien-hub commented 2 months ago

Thank you)

regarding the issue... i had an idea to try to run this whole thing in VirtualBox... i don't know if it's a good idea, but in theory i can try to install any version of cuda there...

KichangKim commented 2 months ago

There are many compatibility issue related to latest Tensorflow for DeepDanbooru. I will fix it in the near future, but no ETA.

KichangKim commented 2 months ago

Fixed by 98d9315ab702e177de706825f9cee981c9afb928. If app is silently crashes, it may be GPU-related issue of tensorflow.

iladrien-hub commented 2 months ago

In case anyone else encounters the problem of the silent “Killed”... In my case, the cause was oom, I added a little bit of RAM and everything started working.