AaronJny / xyolo

一个高度封装的yolov3类库。
MIT License
35 stars 8 forks source link

只能跑一个epoch,然后程序挂起,无法继续训练 #5

Open ddddddreamcastle opened 3 years ago

ddddddreamcastle commented 3 years ago
Create Tiny YOLOv3 model with 6 anchors and 80 classes.
Load weights /DATA/xyolo/xyolo/xyolo_data/keras_weights.h5.
Freeze the first 42 layers of total 44 layers.
2021-01-31 18:03:41.533 | INFO     | xyolo.xyolo.yolo3.yolo:fit:195 - Prepare to train the model...
2021-01-31 18:03:41.533 | INFO     | xyolo.xyolo.yolo3.yolo:fit:208 - Split dataset for validate...
2021-01-31 18:03:41.581 | INFO     | xyolo.xyolo.yolo3.yolo:fit:219 - The first step training begins(50 epochs).
2021-01-31 18:03:41.597 | INFO     | xyolo.xyolo.yolo3.yolo:fit:229 - Train on 32 samples, val on 11704 samples, with batch size 32.
Epoch 1/50
WARNING:tensorflow:AutoGraph could not transform <function YOLO.fit.<locals>.<lambda> at 0x7fd89037a400> and will run it as-is.
Cause: could not parse the source code:

                'yolo_loss': lambda y_true, y_pred: y_pred})

This error may be avoided by creating the lambda in a standalone statement.

To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
2021-01-31 18:03:46.347332: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:581] layout failed: Invalid argument: Subshape must have computed start >= end since stride is negative, but is 0 and 2 (computed from start 0 and end 9223372036854775807 over shape with rank 2 and stride-1)
2021-01-31 18:03:46.357090: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:581] remapper failed: Invalid argument: Subshape must have computed start >= end since stride is negative, but is 0 and 2 (computed from start 0 and end 9223372036854775807 over shape with rank 2 and stride-1)
2021-01-31 18:03:46.461087: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:581] remapper failed: Invalid argument: Subshape must have computed start >= end since stride is negative, but is 0 and 2 (computed from start 0 and end 9223372036854775807 over shape with rank 2 and stride-1)
2021-01-31 18:03:46.713765: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2021-01-31 18:03:48.397605: W tensorflow/stream_executor/gpu/asm_compiler.cc:81] Running ptxas --version returned 256
2021-01-31 18:03:48.523471: W tensorflow/stream_executor/gpu/redzone_allocator.cc:314] Internal: ptxas exited with non-zero error code 256, output: 
Relying on driver to perform ptx compilation. 
Modify $PATH to customize ptxas location.
This message will be only logged once.
2021-01-31 18:03:49.965347: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
1/1 [==============================] - ETA: 0s - loss: 37.31852021-01-31 18:04:08.987385: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:581] layout failed: Invalid argument: Subshape must have computed start >= end since stride is negative, but is 0 and 2 (computed from start 0 and end 9223372036854775807 over shape with rank 2 and stride-1)
2021-01-31 18:04:08.993598: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:581] remapper failed: Invalid argument: Subshape must have computed start >= end since stride is negative, but is 0 and 2 (computed from start 0 and end 9223372036854775807 over shape with rank 2 and stride-1)
2021-01-31 18:04:09.057574: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:581] remapper failed: Invalid argument: Subshape must have computed start >= end since stride is negative, but is 0 and 2 (computed from start 0 and end 9223372036854775807 over shape with rank 2 and stride-1)

可以看到,只跑了第一个epoch,不用管“1/1”只有一步,因为我只是测一下,能否正常训练,所以只选了训练集中的32张图,走一下流程测测。日志到此为止,程序挂起而不断,不再进行了,GPU使用率0%,CPU其中一个核100%,看了下,确实是训练进程在占用。

环境: ubuntu16.04 tensorflow2.3.0 cuda10.1 cudnn7.6.5 python3.6.10 是anaconda中的环境,cuda和cudnn,使用的anaconda中的cudatoolkit10.1和cudnn7.6.5

ddddddreamcastle commented 3 years ago

对了,补充一下,代码就是主页中的样例,换成了tiny-model。 其实,我跑keras-yolo3的代码,也是同样的问题。

LmingXie commented 1 year ago

楼主解决了没

LmingXie commented 1 year ago

我的问题是使用了错误tensorflow和tensorflow-gpu版本,我使用cuda 11.4,安装tensorflow 2.3.0和tensorflow-gpu 2.3.0 可以运行项目,输入

# 进入python
import tensorflow as tf
print(tf.test.is_gpu_available())

测试也能通过,但tensorflow-gpu 2.3.0适配CUDA版本是10.1,11.4选用2.7.4版本后yolov3可以正常使用gpu,至于为什么GPU可用性测试能通过未找到原因。

image

Package                      Version
---------------------------- ----------
absl-py                      0.15.0
aiohttp                      3.8.3
aiosignal                    1.2.0
altgraph                     0.17.3
astor                        0.8.1
astunparse                   1.6.3
async-timeout                4.0.2
asynctest                    0.13.0
attrs                        22.1.0
blinker                      1.4
brotlipy                     0.7.0
cached-property              1.5.2
cachetools                   4.2.4
certifi                      2022.12.7
cffi                         1.15.1
chardet                      5.1.0
charset-normalizer           2.1.0
clang                        5.0
click                        8.1.3
colorama                     0.4.6
cryptography                 39.0.1
cycler                       0.11.0
flatbuffers                  1.12
flit_core                    3.6.0
fonttools                    4.38.0
frozenlist                   1.3.3
gast                         0.4.0
google-auth                  2.18.0
google-auth-oauthlib         0.4.6
google-pasta                 0.2.0
grpcio                       1.54.2
h5py                         3.8.0
idna                         3.4
importlib-metadata           6.6.0
keras                        2.7.0
Keras-Preprocessing          1.1.2
kiwisolver                   1.4.4
labelImg                     1.8.6
libclang                     16.0.0
loguru                       0.7.0
lxml                         4.9.1
Markdown                     3.4.3
MarkupSafe                   2.1.2
matplotlib                   3.5.3
mkl-fft                      1.3.1
mkl-random                   1.2.2
mkl-service                  2.4.0
multidict                    6.0.2
numpy                        1.19.2
oauthlib                     3.2.2
opencv-python                4.7.0.72
opt-einsum                   3.3.0
packaging                    23.1
pefile                       2023.2.7
Pillow                       9.5.0
pip                          23.1.2
protobuf                     3.19.6
pyasn1                       0.5.0
pyasn1-modules               0.3.0
pycparser                    2.21
pyinstaller                  5.10.1
pyinstaller-hooks-contrib    2023.2
PyJWT                        2.4.0
pyOpenSSL                    23.0.0
pyparsing                    3.0.9
PyQt5                        5.15.9
pyqt5-plugins                5.15.9.2.3
PyQt5-Qt5                    5.15.2
pyqt5-tools                  5.15.9.3.3
PySocks                      1.7.1
python-dateutil              2.8.2
python-dotenv                0.21.1
pywin32-ctypes               0.2.0
qt5-applications             5.15.2.2.3
qt5-tools                    5.15.2.1.3
requests                     2.30.0
requests-oauthlib            1.3.1
rsa                          4.9
scipy                        1.7.3
setuptools                   65.6.3
six                          1.15.0
spicy                        0.16.0
tensorboard                  2.11.2
tensorboard-data-server      0.6.1
tensorboard-plugin-wit       1.8.1
tensorflow                   2.7.4
tensorflow-estimator         2.7.0
tensorflow-gpu               2.7.4
tensorflow-gpu-estimator     2.3.0
tensorflow-io-gcs-filesystem 0.31.0
termcolor                    1.1.0
torch                        1.12.1
torchaudio                   0.12.1
torchvision                  0.13.1
tqdm                         4.65.0
typing-extensions            3.7.4.3
urllib3                      1.26.15
Werkzeug                     2.2.3
wheel                        0.35.1
win-inet-pton                1.1.0
win32-setctime               1.1.0
wincertstore                 0.2
wrapt                        1.12.1
xyolo                        0.1.6
yarl                         1.8.1
zipp                         3.15.0

版本参考:https://tensorflow.google.cn/install/source_windows?hl=zh-cn#gpu