KichangKim / DeepDanbooru

AI based multi-label girl image classification system, implemented by using TensorFlow.
MIT License
2.65k stars 260 forks source link

Help reading output #42

Closed da3dsoul closed 3 years ago

da3dsoul commented 3 years ago
2021-09-05 11:47:16.558264: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-09-05 11:47:16.558398: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libamdhip64.so
2021-09-05 11:47:16.764310: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1734] Found device 0 with properties: 
pciBusID: 0000:09:00.0 name: Ellesmere [Radeon RX 470/480/570/570X/580/580X/590]     ROCm AMDGPU Arch: gfx803
coreClock: 1.286GHz coreCount: 32 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 104.31GiB/s
2021-09-05 11:47:16.816826: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library librocblas.so
2021-09-05 11:47:16.845041: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libMIOpen.so
2021-09-05 11:47:17.046085: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libhipfft.so
2021-09-05 11:47:17.049131: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library librocrand.so
2021-09-05 11:47:17.049283: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2021-09-05 11:47:17.049576: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-09-05 11:47:17.050033: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-09-05 11:47:17.050148: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1734] Found device 0 with properties: 
pciBusID: 0000:09:00.0 name: Ellesmere [Radeon RX 470/480/570/570X/580/580X/590]     ROCm AMDGPU Arch: gfx803
coreClock: 1.286GHz coreCount: 32 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 104.31GiB/s
2021-09-05 11:47:17.050181: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library librocblas.so
2021-09-05 11:47:17.050206: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libMIOpen.so
2021-09-05 11:47:17.050230: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libhipfft.so
2021-09-05 11:47:17.050253: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library librocrand.so
2021-09-05 11:47:17.050379: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2021-09-05 11:47:17.050739: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-09-05 11:47:17.050750: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]      0 
2021-09-05 11:47:17.050755: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0:   N 
2021-09-05 11:47:17.050913: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7700 MB memory) -> physical GPU (device: 0, name: Ellesmere [Radeon RX 470/480/570/570X/580/580X/590], pci bus id: 0000:09:00.0)
WARNING:tensorflow:No training configuration found in the save file, so the model was *not* compiled. Compile it manually.
Tags of /media/da3dsoul/Golias/Media/Pictures/Public/92400803_p4.jpg:
2021-09-05 11:47:20.534415: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2021-09-05 11:47:20.554107: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 3792885000 Hz
2021-09-05 11:47:22.564449: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libMIOpen.so
2021-09-05 11:47:24.485907: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library librocblas.so

Can you help me decipher the meaning of some of this? I can't tell if it actually worked and just didn't have enough training to give tags on the image, or if something got in the way. The things I would expect to be a problem, if anything, is it not compiling the model or using MLIR optimizations.

I used

deepdanbooru evaluate "/media/da3dsoul/Golias/Media/Pictures/Public/92400803_p4.jpg" --project-path /media/da3dsoul/Golias/DeepDanbooru/unbooru_model/ --allow-gpu --compile
KichangKim commented 3 years ago

You can safely ignore TensorFlow log which start with "I" (info). You can also disable these log by setting environment variable "TF_CPP_MIN_LOG_LEVEL" to "2".

So without TensorFlow log, your log prints only "Tags of /media/da3dsoul/Golias/Media/Pictures/Public/92400803_p4.jpg:" and empty line. So DeepDanbooru reported there is no estimated tags (which score is larger than 0.5). Try lower score threshold, like --threshold 0.2.

da3dsoul commented 3 years ago

Thanks much. That helps.

da3dsoul commented 3 years ago

Would it be expected to have results (even if they aren't very accurate) after only 2 epochs on the entire danbooru2020 dataset? I just finished another epoch, and a picture which I think has some pretty clear features is giving no results. Even with a threshold of 0.1, it gives nothing. Here's my test image. I've tried several others, but it seems good HonkaiBanner

KichangKim commented 3 years ago

What values are printed in your training log? it contains loss, precision, recall and F1 score. Also I didn't test danbooru 2020 dataset. I used images which are directly downloaded from danbooru server.

da3dsoul commented 3 years ago

There's days of logs. I'll post a snippet. I used DanbooruDownloader as linked. I didn't realize that wasn't danbooru2020.

Epoch[1] Loss=0.010009, P=0.662313, R=0.086312, F1=0.152721, Speed = 7.8 samples/s, 99.80 %, ETA = 2021-09-12 16:55:48
Epoch[1] Loss=0.009176, P=0.673432, R=0.094707, F1=0.166060, Speed = 9.0 samples/s, 99.80 %, ETA = 2021-09-12 16:51:03
Epoch[1] Loss=0.009122, P=0.675325, R=0.087626, F1=0.155125, Speed = 9.0 samples/s, 99.80 %, ETA = 2021-09-12 16:51:06
Epoch[1] Loss=0.008856, P=0.656134, R=0.092095, F1=0.161519, Speed = 8.9 samples/s, 99.80 %, ETA = 2021-09-12 16:51:26
Epoch[1] Loss=0.009291, P=0.643657, R=0.090885, F1=0.159280, Speed = 9.0 samples/s, 99.80 %, ETA = 2021-09-12 16:51:04
Epoch[1] Loss=0.009134, P=0.694757, R=0.095153, F1=0.167381, Speed = 9.1 samples/s, 99.81 %, ETA = 2021-09-12 16:50:57
Epoch[1] Loss=0.009488, P=0.679105, R=0.090864, F1=0.160282, Speed = 9.1 samples/s, 99.81 %, ETA = 2021-09-12 16:50:57
Epoch[1] Loss=0.008713, P=0.657303, R=0.092710, F1=0.162500, Speed = 9.0 samples/s, 99.81 %, ETA = 2021-09-12 16:51:02
Epoch[1] Loss=0.009391, P=0.652574, R=0.088928, F1=0.156526, Speed = 9.1 samples/s, 99.81 %, ETA = 2021-09-12 16:50:57
Epoch[1] Loss=0.009752, P=0.643527, R=0.087389, F1=0.153881, Speed = 9.0 samples/s, 99.81 %, ETA = 2021-09-12 16:51:05
Epoch[1] Loss=0.009485, P=0.685981, R=0.090416, F1=0.159774, Speed = 9.0 samples/s, 99.82 %, ETA = 2021-09-12 16:51:01
Epoch[1] Loss=0.009798, P=0.614815, R=0.084136, F1=0.148016, Speed = 9.0 samples/s, 99.82 %, ETA = 2021-09-12 16:51:06
Epoch[1] Loss=0.009160, P=0.647280, R=0.089961, F1=0.157967, Speed = 9.0 samples/s, 99.82 %, ETA = 2021-09-12 16:51:03
Epoch[1] Loss=0.009350, P=0.661080, R=0.091993, F1=0.161510, Speed = 9.0 samples/s, 99.82 %, ETA = 2021-09-12 16:51:08
Epoch[1] Loss=0.009661, P=0.651119, R=0.080470, F1=0.143238, Speed = 9.0 samples/s, 99.82 %, ETA = 2021-09-12 16:51:03
Epoch[1] Loss=0.009659, P=0.661080, R=0.085687, F1=0.151709, Speed = 9.0 samples/s, 99.82 %, ETA = 2021-09-12 16:51:02
Epoch[1] Loss=0.010209, P=0.707721, R=0.091167, F1=0.161527, Speed = 9.0 samples/s, 99.83 %, ETA = 2021-09-12 16:51:06
Epoch[1] Loss=0.008954, P=0.692022, R=0.097414, F1=0.170788, Speed = 9.0 samples/s, 99.83 %, ETA = 2021-09-12 16:51:08
Epoch[1] Loss=0.009082, P=0.695733, R=0.096575, F1=0.169607, Speed = 9.1 samples/s, 99.83 %, ETA = 2021-09-12 16:50:59
Epoch[1] Loss=0.010610, P=0.623616, R=0.077116, F1=0.137259, Speed = 9.0 samples/s, 99.83 %, ETA = 2021-09-12 16:51:05
Saving checkpoint ... (2021-09-12 16:26:49.032479)
Epoch[1] Loss=0.009796, P=0.668519, R=0.082703, F1=0.147197, Speed = 8.0 samples/s, 99.83 %, ETA = 2021-09-12 16:54:12
Epoch[1] Loss=0.009423, P=0.675373, R=0.087145, F1=0.154371, Speed = 9.0 samples/s, 99.84 %, ETA = 2021-09-12 16:51:05
Epoch[1] Loss=0.009166, P=0.672192, R=0.095275, F1=0.166895, Speed = 9.1 samples/s, 99.84 %, ETA = 2021-09-12 16:50:58
Epoch[1] Loss=0.010024, P=0.691450, R=0.093420, F1=0.164602, Speed = 9.0 samples/s, 99.84 %, ETA = 2021-09-12 16:51:04
Epoch[1] Loss=0.009352, P=0.655493, R=0.088198, F1=0.155477, Speed = 8.9 samples/s, 99.84 %, ETA = 2021-09-12 16:51:22
Epoch[1] Loss=0.009373, P=0.652416, R=0.087597, F1=0.154455, Speed = 8.7 samples/s, 99.84 %, ETA = 2021-09-12 16:51:52
Epoch[1] Loss=0.009812, P=0.656075, R=0.081156, F1=0.144444, Speed = 9.0 samples/s, 99.84 %, ETA = 2021-09-12 16:51:13
Epoch[1] Loss=0.010217, P=0.689720, R=0.081927, F1=0.146458, Speed = 9.1 samples/s, 99.85 %, ETA = 2021-09-12 16:50:57
Epoch[1] Loss=0.009738, P=0.656716, R=0.082824, F1=0.147096, Speed = 9.0 samples/s, 99.85 %, ETA = 2021-09-12 16:51:03
Epoch[1] Loss=0.009829, P=0.649446, R=0.083929, F1=0.148649, Speed = 9.0 samples/s, 99.85 %, ETA = 2021-09-12 16:51:05
Epoch[1] Loss=0.008472, P=0.649718, R=0.089124, F1=0.156747, Speed = 9.1 samples/s, 99.85 %, ETA = 2021-09-12 16:51:00
Epoch[1] Loss=0.009870, P=0.622468, R=0.081979, F1=0.144878, Speed = 9.0 samples/s, 99.85 %, ETA = 2021-09-12 16:51:04
Epoch[1] Loss=0.009430, P=0.687616, R=0.093233, F1=0.164202, Speed = 9.1 samples/s, 99.85 %, ETA = 2021-09-12 16:51:01
Epoch[1] Loss=0.009801, P=0.685083, R=0.091176, F1=0.160934, Speed = 9.1 samples/s, 99.86 %, ETA = 2021-09-12 16:50:59
Epoch[1] Loss=0.009719, P=0.682657, R=0.087970, F1=0.155855, Speed = 9.0 samples/s, 99.86 %, ETA = 2021-09-12 16:51:04
Epoch[1] Loss=0.010146, P=0.667904, R=0.082759, F1=0.147269, Speed = 9.0 samples/s, 99.86 %, ETA = 2021-09-12 16:51:07
Epoch[1] Loss=0.009947, P=0.702206, R=0.087735, F1=0.155982, Speed = 9.1 samples/s, 99.86 %, ETA = 2021-09-12 16:51:01
Epoch[1] Loss=0.010229, P=0.647601, R=0.087075, F1=0.153510, Speed = 9.0 samples/s, 99.86 %, ETA = 2021-09-12 16:51:06
Epoch[1] Loss=0.009836, P=0.681985, R=0.088019, F1=0.155915, Speed = 9.0 samples/s, 99.87 %, ETA = 2021-09-12 16:51:04
Epoch[1] Loss=0.009276, P=0.678373, R=0.093814, F1=0.164833, Speed = 9.0 samples/s, 99.87 %, ETA = 2021-09-12 16:51:05
Saving checkpoint ... (2021-09-12 16:32:01.515339)
Epoch[1] Loss=0.008934, P=0.671587, R=0.093142, F1=0.163596, Speed = 8.1 samples/s, 99.87 %, ETA = 2021-09-12 16:53:16
Epoch[1] Loss=0.009416, P=0.688192, R=0.090184, F1=0.159470, Speed = 8.9 samples/s, 99.87 %, ETA = 2021-09-12 16:51:20
Epoch[1] Loss=0.009463, P=0.677122, R=0.090416, F1=0.159531, Speed = 8.7 samples/s, 99.87 %, ETA = 2021-09-12 16:51:54
Epoch[1] Loss=0.009459, P=0.701107, R=0.093550, F1=0.165074, Speed = 8.2 samples/s, 99.87 %, ETA = 2021-09-12 16:53:01
Epoch[1] Loss=0.009260, P=0.659259, R=0.090378, F1=0.158964, Speed = 8.5 samples/s, 99.88 %, ETA = 2021-09-12 16:52:15
Epoch[1] Loss=0.008939, P=0.687732, R=0.094847, F1=0.166704, Speed = 7.0 samples/s, 99.88 %, ETA = 2021-09-12 16:56:23
Epoch[1] Loss=0.009165, P=0.692737, R=0.090247, F1=0.159691, Speed = 7.8 samples/s, 99.88 %, ETA = 2021-09-12 16:54:00
Epoch[1] Loss=0.009825, P=0.625461, R=0.082422, F1=0.145650, Speed = 7.2 samples/s, 99.88 %, ETA = 2021-09-12 16:55:43
Epoch[1] Loss=0.009284, P=0.643911, R=0.086708, F1=0.152836, Speed = 8.7 samples/s, 99.88 %, ETA = 2021-09-12 16:52:01
Epoch[1] Loss=0.009580, P=0.662983, R=0.089330, F1=0.157446, Speed = 7.8 samples/s, 99.88 %, ETA = 2021-09-12 16:54:04
Epoch[1] Loss=0.009263, P=0.664815, R=0.085415, F1=0.151381, Speed = 8.5 samples/s, 99.89 %, ETA = 2021-09-12 16:52:28
Epoch[1] Loss=0.010090, P=0.659889, R=0.086714, F1=0.153285, Speed = 9.0 samples/s, 99.89 %, ETA = 2021-09-12 16:51:30
Epoch[1] Loss=0.009337, P=0.650558, R=0.091455, F1=0.160367, Speed = 8.0 samples/s, 99.89 %, ETA = 2021-09-12 16:53:31
Epoch[1] Loss=0.009531, P=0.656827, R=0.088712, F1=0.156312, Speed = 5.8 samples/s, 99.89 %, ETA = 2021-09-12 17:00:11
Epoch[1] Loss=0.009358, P=0.661765, R=0.087400, F1=0.154407, Speed = 8.7 samples/s, 99.89 %, ETA = 2021-09-12 16:52:12
Epoch[1] Loss=0.010049, P=0.666048, R=0.086298, F1=0.152798, Speed = 8.8 samples/s, 99.90 %, ETA = 2021-09-12 16:52:02
Epoch[1] Loss=0.009608, P=0.662338, R=0.090471, F1=0.159197, Speed = 9.1 samples/s, 99.90 %, ETA = 2021-09-12 16:51:30
Epoch[1] Loss=0.009822, P=0.631481, R=0.083456, F1=0.147428, Speed = 9.0 samples/s, 99.90 %, ETA = 2021-09-12 16:51:40
Epoch[1] Loss=0.008857, P=0.662313, R=0.086207, F1=0.152557, Speed = 9.0 samples/s, 99.90 %, ETA = 2021-09-12 16:51:36
Epoch[1] Loss=0.009418, P=0.682657, R=0.090024, F1=0.159071, Speed = 9.0 samples/s, 99.90 %, ETA = 2021-09-12 16:51:38
Saving checkpoint ... (2021-09-12 16:37:43.558713)
Epoch[1] Loss=0.009293, P=0.695167, R=0.095652, F1=0.168165, Speed = 7.6 samples/s, 99.90 %, ETA = 2021-09-12 16:54:20
Epoch[1] Loss=0.009694, P=0.664815, R=0.087284, F1=0.154309, Speed = 9.1 samples/s, 99.91 %, ETA = 2021-09-12 16:51:34
Epoch[1] Loss=0.009461, P=0.669131, R=0.090095, F1=0.158807, Speed = 9.0 samples/s, 99.91 %, ETA = 2021-09-12 16:51:43
Epoch[1] Loss=0.009440, P=0.674766, R=0.087770, F1=0.155336, Speed = 9.1 samples/s, 99.91 %, ETA = 2021-09-12 16:51:33
Epoch[1] Loss=0.009403, P=0.689214, R=0.087899, F1=0.155914, Speed = 9.0 samples/s, 99.91 %, ETA = 2021-09-12 16:51:43
Epoch[1] Loss=0.010117, P=0.666667, R=0.084034, F1=0.149254, Speed = 9.0 samples/s, 99.91 %, ETA = 2021-09-12 16:51:40
Epoch[1] Loss=0.009759, P=0.688192, R=0.091109, F1=0.160915, Speed = 9.0 samples/s, 99.92 %, ETA = 2021-09-12 16:51:42
Epoch[1] Loss=0.008953, P=0.666048, R=0.091349, F1=0.160662, Speed = 9.0 samples/s, 99.92 %, ETA = 2021-09-12 16:51:40
Epoch[1] Loss=0.009864, P=0.674677, R=0.086946, F1=0.154041, Speed = 9.1 samples/s, 99.92 %, ETA = 2021-09-12 16:51:37
Epoch[1] Loss=0.009144, P=0.664815, R=0.091279, F1=0.160519, Speed = 7.7 samples/s, 99.92 %, ETA = 2021-09-12 16:53:41
Epoch[1] Loss=0.009868, P=0.647706, R=0.087032, F1=0.153445, Speed = 7.4 samples/s, 99.92 %, ETA = 2021-09-12 16:54:09
Epoch[1] Loss=0.009271, P=0.651291, R=0.089753, F1=0.157765, Speed = 9.1 samples/s, 99.92 %, ETA = 2021-09-12 16:51:42
Epoch[1] Loss=0.009352, P=0.698355, R=0.092561, F1=0.163457, Speed = 9.0 samples/s, 99.93 %, ETA = 2021-09-12 16:51:44
Epoch[1] Loss=0.009226, P=0.691450, R=0.092745, F1=0.163552, Speed = 7.7 samples/s, 99.93 %, ETA = 2021-09-12 16:53:36
Epoch[1] Loss=0.009283, P=0.637708, R=0.080758, F1=0.143362, Speed = 9.1 samples/s, 99.93 %, ETA = 2021-09-12 16:51:45
Epoch[1] Loss=0.010218, P=0.657459, R=0.081026, F1=0.144272, Speed = 9.1 samples/s, 99.93 %, ETA = 2021-09-12 16:51:45
Epoch[1] Loss=0.009675, P=0.606679, R=0.079024, F1=0.139833, Speed = 9.1 samples/s, 99.93 %, ETA = 2021-09-12 16:51:45
Epoch[1] Loss=0.009401, P=0.670956, R=0.089046, F1=0.157226, Speed = 9.0 samples/s, 99.93 %, ETA = 2021-09-12 16:51:47
Epoch[1] Loss=0.009079, P=0.667890, R=0.094349, F1=0.165342, Speed = 9.0 samples/s, 99.94 %, ETA = 2021-09-12 16:51:48
Epoch[1] Loss=0.008731, P=0.688312, R=0.099785, F1=0.174301, Speed = 9.1 samples/s, 99.94 %, ETA = 2021-09-12 16:51:45
Saving checkpoint ... (2021-09-12 16:43:04.464378)
Epoch[1] Loss=0.009721, P=0.677064, R=0.089001, F1=0.157323, Speed = 8.1 samples/s, 99.94 %, ETA = 2021-09-12 16:52:49
Epoch[1] Loss=0.008743, P=0.701299, R=0.101340, F1=0.177091, Speed = 9.1 samples/s, 99.94 %, ETA = 2021-09-12 16:51:47
Epoch[1] Loss=0.009586, P=0.678832, R=0.091671, F1=0.161528, Speed = 9.1 samples/s, 99.94 %, ETA = 2021-09-12 16:51:48
Epoch[1] Loss=0.009641, P=0.626151, R=0.079944, F1=0.141785, Speed = 9.0 samples/s, 99.95 %, ETA = 2021-09-12 16:51:49
Epoch[1] Loss=0.009199, P=0.696133, R=0.095575, F1=0.168075, Speed = 9.0 samples/s, 99.95 %, ETA = 2021-09-12 16:51:50
Epoch[1] Loss=0.009826, P=0.675277, R=0.089247, F1=0.157657, Speed = 9.1 samples/s, 99.95 %, ETA = 2021-09-12 16:51:48
Epoch[1] Loss=0.009301, P=0.654982, R=0.083964, F1=0.148847, Speed = 9.0 samples/s, 99.95 %, ETA = 2021-09-12 16:51:50
Epoch[1] Loss=0.009410, P=0.681481, R=0.092462, F1=0.162832, Speed = 9.1 samples/s, 99.95 %, ETA = 2021-09-12 16:51:47
Epoch[1] Loss=0.009095, P=0.716912, R=0.096368, F1=0.169898, Speed = 9.1 samples/s, 99.95 %, ETA = 2021-09-12 16:51:48
Epoch[1] Loss=0.008949, P=0.699083, R=0.101141, F1=0.176716, Speed = 9.0 samples/s, 99.96 %, ETA = 2021-09-12 16:51:49
Epoch[1] Loss=0.008488, P=0.637383, R=0.090764, F1=0.158900, Speed = 9.0 samples/s, 99.96 %, ETA = 2021-09-12 16:51:49
Epoch[1] Loss=0.009671, P=0.661765, R=0.086207, F1=0.152542, Speed = 9.0 samples/s, 99.96 %, ETA = 2021-09-12 16:51:51
Epoch[1] Loss=0.010721, P=0.645102, R=0.083393, F1=0.147694, Speed = 9.1 samples/s, 99.96 %, ETA = 2021-09-12 16:51:48
Epoch[1] Loss=0.009658, P=0.693309, R=0.087992, F1=0.156165, Speed = 9.0 samples/s, 99.96 %, ETA = 2021-09-12 16:51:49
Epoch[1] Loss=0.009730, P=0.640221, R=0.079991, F1=0.142213, Speed = 9.1 samples/s, 99.97 %, ETA = 2021-09-12 16:51:48
Epoch[1] Loss=0.009328, P=0.670330, R=0.095040, F1=0.166477, Speed = 9.0 samples/s, 99.97 %, ETA = 2021-09-12 16:51:49
Epoch[1] Loss=0.008679, P=0.646409, R=0.091335, F1=0.160055, Speed = 9.0 samples/s, 99.97 %, ETA = 2021-09-12 16:51:50
Epoch[1] Loss=0.009221, P=0.648799, R=0.085194, F1=0.150611, Speed = 9.1 samples/s, 99.97 %, ETA = 2021-09-12 16:51:48
Epoch[1] Loss=0.009234, P=0.657944, R=0.092050, F1=0.161505, Speed = 9.1 samples/s, 99.97 %, ETA = 2021-09-12 16:51:48
Epoch[1] Loss=0.009340, P=0.665441, R=0.088617, F1=0.156405, Speed = 9.0 samples/s, 99.97 %, ETA = 2021-09-12 16:51:49
Saving checkpoint ... (2021-09-12 16:48:15.916625)
Epoch[1] Loss=0.009778, P=0.633333, R=0.086846, F1=0.152747, Speed = 8.0 samples/s, 99.98 %, ETA = 2021-09-12 16:52:17
Epoch[1] Loss=0.011003, P=0.706100, R=0.093376, F1=0.164940, Speed = 9.0 samples/s, 99.98 %, ETA = 2021-09-12 16:51:51
Epoch[1] Loss=0.009351, P=0.704797, R=0.095357, F1=0.167986, Speed = 9.0 samples/s, 99.98 %, ETA = 2021-09-12 16:51:51
Epoch[1] Loss=0.008898, P=0.677064, R=0.096094, F1=0.168301, Speed = 9.0 samples/s, 99.98 %, ETA = 2021-09-12 16:51:51
Epoch[1] Loss=0.009003, P=0.623400, R=0.086024, F1=0.151186, Speed = 9.0 samples/s, 99.98 %, ETA = 2021-09-12 16:51:51
Epoch[1] Loss=0.008821, P=0.661142, R=0.089616, F1=0.157837, Speed = 9.0 samples/s, 99.98 %, ETA = 2021-09-12 16:51:51
Epoch[1] Loss=0.009215, P=0.654917, R=0.087182, F1=0.153880, Speed = 9.1 samples/s, 99.99 %, ETA = 2021-09-12 16:51:50
Epoch[1] Loss=0.009103, P=0.690037, R=0.092028, F1=0.162397, Speed = 9.0 samples/s, 99.99 %, ETA = 2021-09-12 16:51:51
Epoch[1] Loss=0.009562, P=0.634686, R=0.081285, F1=0.144114, Speed = 9.0 samples/s, 99.99 %, ETA = 2021-09-12 16:51:51
Epoch[1] Loss=0.008967, P=0.682657, R=0.095140, F1=0.167005, Speed = 9.0 samples/s, 99.99 %, ETA = 2021-09-12 16:51:51
Epoch[1] Loss=0.009289, P=0.677122, R=0.091498, F1=0.161212, Speed = 9.0 samples/s, 99.99 %, ETA = 2021-09-12 16:51:51
Epoch[1] Loss=0.009526, P=0.695167, R=0.089990, F1=0.159352, Speed = 9.0 samples/s, 100.00 %, ETA = 2021-09-12 16:51:51
Epoch[1] Loss=0.009428, P=0.683150, R=0.090162, F1=0.159300, Speed = 9.1 samples/s, 100.00 %, ETA = 2021-09-12 16:51:51

I don't know what any of these mean, of course. Machine learning is still a learning process badum tss

KichangKim commented 3 years ago

It seems that R value is too small. I think that your training is failed for some problem (overfitting, GPU calculation error, and so on).

I think that your training log contains some point that R value is decreased suddenly. If you find that point, you should re-train from nearest checkout. So I'll recommend periodically backup checkpoints.

da3dsoul commented 3 years ago

I did have several system crashes from OOM before. That was probably the cause, huh. Is it recoverable somehow, or do I need to just start over? The checkpoint file is 3GB and I don't see any logs (been going for weeks).

EDIT: I bought more RAM, so that won't happen again

KichangKim commented 3 years ago

I recommend that train from start and carefully monitor R value on log. And backup checkpoints folder everyday and if the R value is suddenly decreased, cancel training and restore checkpoints folder then start again.

da3dsoul commented 3 years ago

ok. Are there logs? I only have console output, and redirecting ( $ program > output.log ) isn't working

EDIT: on restart, I've got values like so, are these good?

Epoch[0] Loss=0.739733, P=0.001940, R=0.501037, F1=0.003865, Speed = 4.6 samples/s, 0.00 %, ETA = 2021-09-23 05:53:49
KichangKim commented 3 years ago

Are there logs?

Current DD has only console output log.

I've got values like so, are these good?

Yes, starting value seems no problem.

da3dsoul commented 3 years ago

Ok. I'll let it run. Thanks for all your help

da3dsoul commented 3 years ago

It didn't even last 12 hours. I'm not sure how long it lasted. I'll try running it CPU only and see if it at least maintains an R value. In the worst case, I can run it over network on a machine with an RTX3070, as running CPU only takes like 8x longer, and that's on the scale of months here. Is the model portable, or does it rely on absolute paths? I can buy a new GPU for the server it's running on, but that'll take time with the current market.

KichangKim commented 3 years ago

The model is portable. It uses relative path and you can use different hardware for training <-> evaluating.

da3dsoul commented 3 years ago

Perfect, thanks.

da3dsoul commented 3 years ago

log.txt You said a sudden drop is bad, but what about a gradual one?

I found this and have been reading. https://neptune.ai/blog/keras-metrics Correct me if I'm wrong please. R is recall. the value returned is something like this:

def recall(y_true, y_pred):
    y_true = K.ones_like(y_true) 
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    all_positives = K.sum(K.round(K.clip(y_true, 0, 1)))

    recall = true_positives / (all_positives + K.epsilon())
    return recall
KichangKim commented 3 years ago

Gradual one is fine. At initial, network is initialized with random value, so R value is quite high, then it will be decreased to some point and gradually increased again (with P value). Once R value is increasing, it should not be dropped suddenly.

da3dsoul commented 3 years ago

Ok, I'll keep an eye on it

da3dsoul commented 3 years ago
Epoch[0] Loss=0.273658, P=0.084761, R=0.094115, F1=0.089193, Speed = 47.9 samples/s, 4.02 %, ETA = 2021-09-15 11:14:22
Epoch[0] Loss=0.276624, P=0.067192, R=0.101965, F1=0.081004, Speed = 46.4 samples/s, 4.02 %, ETA = 2021-09-15 11:58:56
Epoch[0] Loss=0.273647, P=0.078138, R=0.089073, F1=0.083248, Speed = 46.9 samples/s, 4.02 %, ETA = 2021-09-15 11:41:45

Should've used the 3070 from the start. That's like 5x faster than the RX570....

da3dsoul commented 3 years ago

Random thing before I wait again after tweaking perf for the 3070. Is this okay? The None in Model instinctively makes me worry.

Using SGD optimizer ...
Loading tags ...
Creating model (resnet_custom_v4) ...
Model : (None, 299, 299, 3) -> (None, 14176)
Loading database ...
KichangKim commented 3 years ago

None is okay :) Don't worry.

da3dsoul commented 3 years ago

Ok thanks.

da3dsoul commented 3 years ago

First epoch yielded results! They weren't perfect results, but I wouldn't expect it. I have it queued for another 9 epochs. Thank you for all your help so far

da3dsoul commented 3 years ago

I'm all of a sudden getting huge performance hits, and I have no idea why. CUDA (or any other GPU graph) is not even utilized, let alone bottlenecked. Do you have any ideas?

CCzGv2OGTR
KichangKim commented 3 years ago

I've never seen anything like this. But it seems hardware trouble (or throttling?). I recommend to check GPU temperature and cooling fan status.

da3dsoul commented 3 years ago

It's 46C, so maybe? That was my first guess. 46 is cold for a GPU at load, but it has been running for weeks, so idk. I'll keep looking at it. Thanks.

EDIT: a full restart fixed it