Closed da3dsoul closed 3 years ago
You can safely ignore TensorFlow log which start with "I" (info). You can also disable these log by setting environment variable "TF_CPP_MIN_LOG_LEVEL" to "2".
So without TensorFlow log, your log prints only "Tags of /media/da3dsoul/Golias/Media/Pictures/Public/92400803_p4.jpg:" and empty line. So DeepDanbooru reported there is no estimated tags (which score is larger than 0.5). Try lower score threshold, like --threshold 0.2
.
Thanks much. That helps.
Would it be expected to have results (even if they aren't very accurate) after only 2 epochs on the entire danbooru2020 dataset? I just finished another epoch, and a picture which I think has some pretty clear features is giving no results. Even with a threshold of 0.1, it gives nothing. Here's my test image. I've tried several others, but it seems good
What values are printed in your training log? it contains loss, precision, recall and F1 score. Also I didn't test danbooru 2020 dataset. I used images which are directly downloaded from danbooru server.
There's days of logs. I'll post a snippet. I used DanbooruDownloader as linked. I didn't realize that wasn't danbooru2020.
Epoch[1] Loss=0.010009, P=0.662313, R=0.086312, F1=0.152721, Speed = 7.8 samples/s, 99.80 %, ETA = 2021-09-12 16:55:48
Epoch[1] Loss=0.009176, P=0.673432, R=0.094707, F1=0.166060, Speed = 9.0 samples/s, 99.80 %, ETA = 2021-09-12 16:51:03
Epoch[1] Loss=0.009122, P=0.675325, R=0.087626, F1=0.155125, Speed = 9.0 samples/s, 99.80 %, ETA = 2021-09-12 16:51:06
Epoch[1] Loss=0.008856, P=0.656134, R=0.092095, F1=0.161519, Speed = 8.9 samples/s, 99.80 %, ETA = 2021-09-12 16:51:26
Epoch[1] Loss=0.009291, P=0.643657, R=0.090885, F1=0.159280, Speed = 9.0 samples/s, 99.80 %, ETA = 2021-09-12 16:51:04
Epoch[1] Loss=0.009134, P=0.694757, R=0.095153, F1=0.167381, Speed = 9.1 samples/s, 99.81 %, ETA = 2021-09-12 16:50:57
Epoch[1] Loss=0.009488, P=0.679105, R=0.090864, F1=0.160282, Speed = 9.1 samples/s, 99.81 %, ETA = 2021-09-12 16:50:57
Epoch[1] Loss=0.008713, P=0.657303, R=0.092710, F1=0.162500, Speed = 9.0 samples/s, 99.81 %, ETA = 2021-09-12 16:51:02
Epoch[1] Loss=0.009391, P=0.652574, R=0.088928, F1=0.156526, Speed = 9.1 samples/s, 99.81 %, ETA = 2021-09-12 16:50:57
Epoch[1] Loss=0.009752, P=0.643527, R=0.087389, F1=0.153881, Speed = 9.0 samples/s, 99.81 %, ETA = 2021-09-12 16:51:05
Epoch[1] Loss=0.009485, P=0.685981, R=0.090416, F1=0.159774, Speed = 9.0 samples/s, 99.82 %, ETA = 2021-09-12 16:51:01
Epoch[1] Loss=0.009798, P=0.614815, R=0.084136, F1=0.148016, Speed = 9.0 samples/s, 99.82 %, ETA = 2021-09-12 16:51:06
Epoch[1] Loss=0.009160, P=0.647280, R=0.089961, F1=0.157967, Speed = 9.0 samples/s, 99.82 %, ETA = 2021-09-12 16:51:03
Epoch[1] Loss=0.009350, P=0.661080, R=0.091993, F1=0.161510, Speed = 9.0 samples/s, 99.82 %, ETA = 2021-09-12 16:51:08
Epoch[1] Loss=0.009661, P=0.651119, R=0.080470, F1=0.143238, Speed = 9.0 samples/s, 99.82 %, ETA = 2021-09-12 16:51:03
Epoch[1] Loss=0.009659, P=0.661080, R=0.085687, F1=0.151709, Speed = 9.0 samples/s, 99.82 %, ETA = 2021-09-12 16:51:02
Epoch[1] Loss=0.010209, P=0.707721, R=0.091167, F1=0.161527, Speed = 9.0 samples/s, 99.83 %, ETA = 2021-09-12 16:51:06
Epoch[1] Loss=0.008954, P=0.692022, R=0.097414, F1=0.170788, Speed = 9.0 samples/s, 99.83 %, ETA = 2021-09-12 16:51:08
Epoch[1] Loss=0.009082, P=0.695733, R=0.096575, F1=0.169607, Speed = 9.1 samples/s, 99.83 %, ETA = 2021-09-12 16:50:59
Epoch[1] Loss=0.010610, P=0.623616, R=0.077116, F1=0.137259, Speed = 9.0 samples/s, 99.83 %, ETA = 2021-09-12 16:51:05
Saving checkpoint ... (2021-09-12 16:26:49.032479)
Epoch[1] Loss=0.009796, P=0.668519, R=0.082703, F1=0.147197, Speed = 8.0 samples/s, 99.83 %, ETA = 2021-09-12 16:54:12
Epoch[1] Loss=0.009423, P=0.675373, R=0.087145, F1=0.154371, Speed = 9.0 samples/s, 99.84 %, ETA = 2021-09-12 16:51:05
Epoch[1] Loss=0.009166, P=0.672192, R=0.095275, F1=0.166895, Speed = 9.1 samples/s, 99.84 %, ETA = 2021-09-12 16:50:58
Epoch[1] Loss=0.010024, P=0.691450, R=0.093420, F1=0.164602, Speed = 9.0 samples/s, 99.84 %, ETA = 2021-09-12 16:51:04
Epoch[1] Loss=0.009352, P=0.655493, R=0.088198, F1=0.155477, Speed = 8.9 samples/s, 99.84 %, ETA = 2021-09-12 16:51:22
Epoch[1] Loss=0.009373, P=0.652416, R=0.087597, F1=0.154455, Speed = 8.7 samples/s, 99.84 %, ETA = 2021-09-12 16:51:52
Epoch[1] Loss=0.009812, P=0.656075, R=0.081156, F1=0.144444, Speed = 9.0 samples/s, 99.84 %, ETA = 2021-09-12 16:51:13
Epoch[1] Loss=0.010217, P=0.689720, R=0.081927, F1=0.146458, Speed = 9.1 samples/s, 99.85 %, ETA = 2021-09-12 16:50:57
Epoch[1] Loss=0.009738, P=0.656716, R=0.082824, F1=0.147096, Speed = 9.0 samples/s, 99.85 %, ETA = 2021-09-12 16:51:03
Epoch[1] Loss=0.009829, P=0.649446, R=0.083929, F1=0.148649, Speed = 9.0 samples/s, 99.85 %, ETA = 2021-09-12 16:51:05
Epoch[1] Loss=0.008472, P=0.649718, R=0.089124, F1=0.156747, Speed = 9.1 samples/s, 99.85 %, ETA = 2021-09-12 16:51:00
Epoch[1] Loss=0.009870, P=0.622468, R=0.081979, F1=0.144878, Speed = 9.0 samples/s, 99.85 %, ETA = 2021-09-12 16:51:04
Epoch[1] Loss=0.009430, P=0.687616, R=0.093233, F1=0.164202, Speed = 9.1 samples/s, 99.85 %, ETA = 2021-09-12 16:51:01
Epoch[1] Loss=0.009801, P=0.685083, R=0.091176, F1=0.160934, Speed = 9.1 samples/s, 99.86 %, ETA = 2021-09-12 16:50:59
Epoch[1] Loss=0.009719, P=0.682657, R=0.087970, F1=0.155855, Speed = 9.0 samples/s, 99.86 %, ETA = 2021-09-12 16:51:04
Epoch[1] Loss=0.010146, P=0.667904, R=0.082759, F1=0.147269, Speed = 9.0 samples/s, 99.86 %, ETA = 2021-09-12 16:51:07
Epoch[1] Loss=0.009947, P=0.702206, R=0.087735, F1=0.155982, Speed = 9.1 samples/s, 99.86 %, ETA = 2021-09-12 16:51:01
Epoch[1] Loss=0.010229, P=0.647601, R=0.087075, F1=0.153510, Speed = 9.0 samples/s, 99.86 %, ETA = 2021-09-12 16:51:06
Epoch[1] Loss=0.009836, P=0.681985, R=0.088019, F1=0.155915, Speed = 9.0 samples/s, 99.87 %, ETA = 2021-09-12 16:51:04
Epoch[1] Loss=0.009276, P=0.678373, R=0.093814, F1=0.164833, Speed = 9.0 samples/s, 99.87 %, ETA = 2021-09-12 16:51:05
Saving checkpoint ... (2021-09-12 16:32:01.515339)
Epoch[1] Loss=0.008934, P=0.671587, R=0.093142, F1=0.163596, Speed = 8.1 samples/s, 99.87 %, ETA = 2021-09-12 16:53:16
Epoch[1] Loss=0.009416, P=0.688192, R=0.090184, F1=0.159470, Speed = 8.9 samples/s, 99.87 %, ETA = 2021-09-12 16:51:20
Epoch[1] Loss=0.009463, P=0.677122, R=0.090416, F1=0.159531, Speed = 8.7 samples/s, 99.87 %, ETA = 2021-09-12 16:51:54
Epoch[1] Loss=0.009459, P=0.701107, R=0.093550, F1=0.165074, Speed = 8.2 samples/s, 99.87 %, ETA = 2021-09-12 16:53:01
Epoch[1] Loss=0.009260, P=0.659259, R=0.090378, F1=0.158964, Speed = 8.5 samples/s, 99.88 %, ETA = 2021-09-12 16:52:15
Epoch[1] Loss=0.008939, P=0.687732, R=0.094847, F1=0.166704, Speed = 7.0 samples/s, 99.88 %, ETA = 2021-09-12 16:56:23
Epoch[1] Loss=0.009165, P=0.692737, R=0.090247, F1=0.159691, Speed = 7.8 samples/s, 99.88 %, ETA = 2021-09-12 16:54:00
Epoch[1] Loss=0.009825, P=0.625461, R=0.082422, F1=0.145650, Speed = 7.2 samples/s, 99.88 %, ETA = 2021-09-12 16:55:43
Epoch[1] Loss=0.009284, P=0.643911, R=0.086708, F1=0.152836, Speed = 8.7 samples/s, 99.88 %, ETA = 2021-09-12 16:52:01
Epoch[1] Loss=0.009580, P=0.662983, R=0.089330, F1=0.157446, Speed = 7.8 samples/s, 99.88 %, ETA = 2021-09-12 16:54:04
Epoch[1] Loss=0.009263, P=0.664815, R=0.085415, F1=0.151381, Speed = 8.5 samples/s, 99.89 %, ETA = 2021-09-12 16:52:28
Epoch[1] Loss=0.010090, P=0.659889, R=0.086714, F1=0.153285, Speed = 9.0 samples/s, 99.89 %, ETA = 2021-09-12 16:51:30
Epoch[1] Loss=0.009337, P=0.650558, R=0.091455, F1=0.160367, Speed = 8.0 samples/s, 99.89 %, ETA = 2021-09-12 16:53:31
Epoch[1] Loss=0.009531, P=0.656827, R=0.088712, F1=0.156312, Speed = 5.8 samples/s, 99.89 %, ETA = 2021-09-12 17:00:11
Epoch[1] Loss=0.009358, P=0.661765, R=0.087400, F1=0.154407, Speed = 8.7 samples/s, 99.89 %, ETA = 2021-09-12 16:52:12
Epoch[1] Loss=0.010049, P=0.666048, R=0.086298, F1=0.152798, Speed = 8.8 samples/s, 99.90 %, ETA = 2021-09-12 16:52:02
Epoch[1] Loss=0.009608, P=0.662338, R=0.090471, F1=0.159197, Speed = 9.1 samples/s, 99.90 %, ETA = 2021-09-12 16:51:30
Epoch[1] Loss=0.009822, P=0.631481, R=0.083456, F1=0.147428, Speed = 9.0 samples/s, 99.90 %, ETA = 2021-09-12 16:51:40
Epoch[1] Loss=0.008857, P=0.662313, R=0.086207, F1=0.152557, Speed = 9.0 samples/s, 99.90 %, ETA = 2021-09-12 16:51:36
Epoch[1] Loss=0.009418, P=0.682657, R=0.090024, F1=0.159071, Speed = 9.0 samples/s, 99.90 %, ETA = 2021-09-12 16:51:38
Saving checkpoint ... (2021-09-12 16:37:43.558713)
Epoch[1] Loss=0.009293, P=0.695167, R=0.095652, F1=0.168165, Speed = 7.6 samples/s, 99.90 %, ETA = 2021-09-12 16:54:20
Epoch[1] Loss=0.009694, P=0.664815, R=0.087284, F1=0.154309, Speed = 9.1 samples/s, 99.91 %, ETA = 2021-09-12 16:51:34
Epoch[1] Loss=0.009461, P=0.669131, R=0.090095, F1=0.158807, Speed = 9.0 samples/s, 99.91 %, ETA = 2021-09-12 16:51:43
Epoch[1] Loss=0.009440, P=0.674766, R=0.087770, F1=0.155336, Speed = 9.1 samples/s, 99.91 %, ETA = 2021-09-12 16:51:33
Epoch[1] Loss=0.009403, P=0.689214, R=0.087899, F1=0.155914, Speed = 9.0 samples/s, 99.91 %, ETA = 2021-09-12 16:51:43
Epoch[1] Loss=0.010117, P=0.666667, R=0.084034, F1=0.149254, Speed = 9.0 samples/s, 99.91 %, ETA = 2021-09-12 16:51:40
Epoch[1] Loss=0.009759, P=0.688192, R=0.091109, F1=0.160915, Speed = 9.0 samples/s, 99.92 %, ETA = 2021-09-12 16:51:42
Epoch[1] Loss=0.008953, P=0.666048, R=0.091349, F1=0.160662, Speed = 9.0 samples/s, 99.92 %, ETA = 2021-09-12 16:51:40
Epoch[1] Loss=0.009864, P=0.674677, R=0.086946, F1=0.154041, Speed = 9.1 samples/s, 99.92 %, ETA = 2021-09-12 16:51:37
Epoch[1] Loss=0.009144, P=0.664815, R=0.091279, F1=0.160519, Speed = 7.7 samples/s, 99.92 %, ETA = 2021-09-12 16:53:41
Epoch[1] Loss=0.009868, P=0.647706, R=0.087032, F1=0.153445, Speed = 7.4 samples/s, 99.92 %, ETA = 2021-09-12 16:54:09
Epoch[1] Loss=0.009271, P=0.651291, R=0.089753, F1=0.157765, Speed = 9.1 samples/s, 99.92 %, ETA = 2021-09-12 16:51:42
Epoch[1] Loss=0.009352, P=0.698355, R=0.092561, F1=0.163457, Speed = 9.0 samples/s, 99.93 %, ETA = 2021-09-12 16:51:44
Epoch[1] Loss=0.009226, P=0.691450, R=0.092745, F1=0.163552, Speed = 7.7 samples/s, 99.93 %, ETA = 2021-09-12 16:53:36
Epoch[1] Loss=0.009283, P=0.637708, R=0.080758, F1=0.143362, Speed = 9.1 samples/s, 99.93 %, ETA = 2021-09-12 16:51:45
Epoch[1] Loss=0.010218, P=0.657459, R=0.081026, F1=0.144272, Speed = 9.1 samples/s, 99.93 %, ETA = 2021-09-12 16:51:45
Epoch[1] Loss=0.009675, P=0.606679, R=0.079024, F1=0.139833, Speed = 9.1 samples/s, 99.93 %, ETA = 2021-09-12 16:51:45
Epoch[1] Loss=0.009401, P=0.670956, R=0.089046, F1=0.157226, Speed = 9.0 samples/s, 99.93 %, ETA = 2021-09-12 16:51:47
Epoch[1] Loss=0.009079, P=0.667890, R=0.094349, F1=0.165342, Speed = 9.0 samples/s, 99.94 %, ETA = 2021-09-12 16:51:48
Epoch[1] Loss=0.008731, P=0.688312, R=0.099785, F1=0.174301, Speed = 9.1 samples/s, 99.94 %, ETA = 2021-09-12 16:51:45
Saving checkpoint ... (2021-09-12 16:43:04.464378)
Epoch[1] Loss=0.009721, P=0.677064, R=0.089001, F1=0.157323, Speed = 8.1 samples/s, 99.94 %, ETA = 2021-09-12 16:52:49
Epoch[1] Loss=0.008743, P=0.701299, R=0.101340, F1=0.177091, Speed = 9.1 samples/s, 99.94 %, ETA = 2021-09-12 16:51:47
Epoch[1] Loss=0.009586, P=0.678832, R=0.091671, F1=0.161528, Speed = 9.1 samples/s, 99.94 %, ETA = 2021-09-12 16:51:48
Epoch[1] Loss=0.009641, P=0.626151, R=0.079944, F1=0.141785, Speed = 9.0 samples/s, 99.95 %, ETA = 2021-09-12 16:51:49
Epoch[1] Loss=0.009199, P=0.696133, R=0.095575, F1=0.168075, Speed = 9.0 samples/s, 99.95 %, ETA = 2021-09-12 16:51:50
Epoch[1] Loss=0.009826, P=0.675277, R=0.089247, F1=0.157657, Speed = 9.1 samples/s, 99.95 %, ETA = 2021-09-12 16:51:48
Epoch[1] Loss=0.009301, P=0.654982, R=0.083964, F1=0.148847, Speed = 9.0 samples/s, 99.95 %, ETA = 2021-09-12 16:51:50
Epoch[1] Loss=0.009410, P=0.681481, R=0.092462, F1=0.162832, Speed = 9.1 samples/s, 99.95 %, ETA = 2021-09-12 16:51:47
Epoch[1] Loss=0.009095, P=0.716912, R=0.096368, F1=0.169898, Speed = 9.1 samples/s, 99.95 %, ETA = 2021-09-12 16:51:48
Epoch[1] Loss=0.008949, P=0.699083, R=0.101141, F1=0.176716, Speed = 9.0 samples/s, 99.96 %, ETA = 2021-09-12 16:51:49
Epoch[1] Loss=0.008488, P=0.637383, R=0.090764, F1=0.158900, Speed = 9.0 samples/s, 99.96 %, ETA = 2021-09-12 16:51:49
Epoch[1] Loss=0.009671, P=0.661765, R=0.086207, F1=0.152542, Speed = 9.0 samples/s, 99.96 %, ETA = 2021-09-12 16:51:51
Epoch[1] Loss=0.010721, P=0.645102, R=0.083393, F1=0.147694, Speed = 9.1 samples/s, 99.96 %, ETA = 2021-09-12 16:51:48
Epoch[1] Loss=0.009658, P=0.693309, R=0.087992, F1=0.156165, Speed = 9.0 samples/s, 99.96 %, ETA = 2021-09-12 16:51:49
Epoch[1] Loss=0.009730, P=0.640221, R=0.079991, F1=0.142213, Speed = 9.1 samples/s, 99.97 %, ETA = 2021-09-12 16:51:48
Epoch[1] Loss=0.009328, P=0.670330, R=0.095040, F1=0.166477, Speed = 9.0 samples/s, 99.97 %, ETA = 2021-09-12 16:51:49
Epoch[1] Loss=0.008679, P=0.646409, R=0.091335, F1=0.160055, Speed = 9.0 samples/s, 99.97 %, ETA = 2021-09-12 16:51:50
Epoch[1] Loss=0.009221, P=0.648799, R=0.085194, F1=0.150611, Speed = 9.1 samples/s, 99.97 %, ETA = 2021-09-12 16:51:48
Epoch[1] Loss=0.009234, P=0.657944, R=0.092050, F1=0.161505, Speed = 9.1 samples/s, 99.97 %, ETA = 2021-09-12 16:51:48
Epoch[1] Loss=0.009340, P=0.665441, R=0.088617, F1=0.156405, Speed = 9.0 samples/s, 99.97 %, ETA = 2021-09-12 16:51:49
Saving checkpoint ... (2021-09-12 16:48:15.916625)
Epoch[1] Loss=0.009778, P=0.633333, R=0.086846, F1=0.152747, Speed = 8.0 samples/s, 99.98 %, ETA = 2021-09-12 16:52:17
Epoch[1] Loss=0.011003, P=0.706100, R=0.093376, F1=0.164940, Speed = 9.0 samples/s, 99.98 %, ETA = 2021-09-12 16:51:51
Epoch[1] Loss=0.009351, P=0.704797, R=0.095357, F1=0.167986, Speed = 9.0 samples/s, 99.98 %, ETA = 2021-09-12 16:51:51
Epoch[1] Loss=0.008898, P=0.677064, R=0.096094, F1=0.168301, Speed = 9.0 samples/s, 99.98 %, ETA = 2021-09-12 16:51:51
Epoch[1] Loss=0.009003, P=0.623400, R=0.086024, F1=0.151186, Speed = 9.0 samples/s, 99.98 %, ETA = 2021-09-12 16:51:51
Epoch[1] Loss=0.008821, P=0.661142, R=0.089616, F1=0.157837, Speed = 9.0 samples/s, 99.98 %, ETA = 2021-09-12 16:51:51
Epoch[1] Loss=0.009215, P=0.654917, R=0.087182, F1=0.153880, Speed = 9.1 samples/s, 99.99 %, ETA = 2021-09-12 16:51:50
Epoch[1] Loss=0.009103, P=0.690037, R=0.092028, F1=0.162397, Speed = 9.0 samples/s, 99.99 %, ETA = 2021-09-12 16:51:51
Epoch[1] Loss=0.009562, P=0.634686, R=0.081285, F1=0.144114, Speed = 9.0 samples/s, 99.99 %, ETA = 2021-09-12 16:51:51
Epoch[1] Loss=0.008967, P=0.682657, R=0.095140, F1=0.167005, Speed = 9.0 samples/s, 99.99 %, ETA = 2021-09-12 16:51:51
Epoch[1] Loss=0.009289, P=0.677122, R=0.091498, F1=0.161212, Speed = 9.0 samples/s, 99.99 %, ETA = 2021-09-12 16:51:51
Epoch[1] Loss=0.009526, P=0.695167, R=0.089990, F1=0.159352, Speed = 9.0 samples/s, 100.00 %, ETA = 2021-09-12 16:51:51
Epoch[1] Loss=0.009428, P=0.683150, R=0.090162, F1=0.159300, Speed = 9.1 samples/s, 100.00 %, ETA = 2021-09-12 16:51:51
I don't know what any of these mean, of course. Machine learning is still a learning process badum tss
It seems that R value is too small. I think that your training is failed for some problem (overfitting, GPU calculation error, and so on).
I think that your training log contains some point that R value is decreased suddenly. If you find that point, you should re-train from nearest checkout. So I'll recommend periodically backup checkpoints.
I did have several system crashes from OOM before. That was probably the cause, huh. Is it recoverable somehow, or do I need to just start over? The checkpoint file is 3GB and I don't see any logs (been going for weeks).
EDIT: I bought more RAM, so that won't happen again
I recommend that train from start and carefully monitor R value on log. And backup checkpoints folder everyday and if the R value is suddenly decreased, cancel training and restore checkpoints folder then start again.
ok. Are there logs? I only have console output, and redirecting ( $ program > output.log
) isn't working
EDIT: on restart, I've got values like so, are these good?
Epoch[0] Loss=0.739733, P=0.001940, R=0.501037, F1=0.003865, Speed = 4.6 samples/s, 0.00 %, ETA = 2021-09-23 05:53:49
Are there logs?
Current DD has only console output log.
I've got values like so, are these good?
Yes, starting value seems no problem.
Ok. I'll let it run. Thanks for all your help
It didn't even last 12 hours. I'm not sure how long it lasted. I'll try running it CPU only and see if it at least maintains an R value. In the worst case, I can run it over network on a machine with an RTX3070, as running CPU only takes like 8x longer, and that's on the scale of months here. Is the model portable, or does it rely on absolute paths? I can buy a new GPU for the server it's running on, but that'll take time with the current market.
The model is portable. It uses relative path and you can use different hardware for training <-> evaluating.
Perfect, thanks.
log.txt You said a sudden drop is bad, but what about a gradual one?
I found this and have been reading. https://neptune.ai/blog/keras-metrics Correct me if I'm wrong please. R is recall. the value returned is something like this:
def recall(y_true, y_pred):
y_true = K.ones_like(y_true)
true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
all_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
recall = true_positives / (all_positives + K.epsilon())
return recall
Gradual one is fine. At initial, network is initialized with random value, so R value is quite high, then it will be decreased to some point and gradually increased again (with P value). Once R value is increasing, it should not be dropped suddenly.
Ok, I'll keep an eye on it
Epoch[0] Loss=0.273658, P=0.084761, R=0.094115, F1=0.089193, Speed = 47.9 samples/s, 4.02 %, ETA = 2021-09-15 11:14:22
Epoch[0] Loss=0.276624, P=0.067192, R=0.101965, F1=0.081004, Speed = 46.4 samples/s, 4.02 %, ETA = 2021-09-15 11:58:56
Epoch[0] Loss=0.273647, P=0.078138, R=0.089073, F1=0.083248, Speed = 46.9 samples/s, 4.02 %, ETA = 2021-09-15 11:41:45
Should've used the 3070 from the start. That's like 5x faster than the RX570....
Random thing before I wait again after tweaking perf for the 3070. Is this okay? The None in Model instinctively makes me worry.
Using SGD optimizer ...
Loading tags ...
Creating model (resnet_custom_v4) ...
Model : (None, 299, 299, 3) -> (None, 14176)
Loading database ...
None is okay :) Don't worry.
Ok thanks.
First epoch yielded results! They weren't perfect results, but I wouldn't expect it. I have it queued for another 9 epochs. Thank you for all your help so far
I'm all of a sudden getting huge performance hits, and I have no idea why. CUDA (or any other GPU graph) is not even utilized, let alone bottlenecked. Do you have any ideas?
I've never seen anything like this. But it seems hardware trouble (or throttling?). I recommend to check GPU temperature and cooling fan status.
It's 46C, so maybe? That was my first guess. 46 is cold for a GPU at load, but it has been running for weeks, so idk. I'll keep looking at it. Thanks.
EDIT: a full restart fixed it
Can you help me decipher the meaning of some of this? I can't tell if it actually worked and just didn't have enough training to give tags on the image, or if something got in the way. The things I would expect to be a problem, if anything, is it not compiling the model or using MLIR optimizations.
I used