PaddlePaddle / PaddleSeg

Easy-to-use image segmentation library with awesome pre-trained model zoo, supporting wide-range of practical tasks in Semantic Segmentation, Interactive Segmentation, Panoptic Segmentation, Image Matting, 3D Segmentation, etc.
https://arxiv.org/abs/2101.06175
Apache License 2.0
8.57k stars 1.68k forks source link

安装完成后用optic_disc_seg测试,在训练启动时卡住不动,也不报错 #1278

Closed liuhuaguang closed 3 years ago

liuhuaguang commented 3 years ago

1)PaddlePaddle版本:2.2 2)GPU:1080 ti, cuda 10.2,cudnn 7.6.5 3)系统环境:Windows 10 专业版19042.1165, python 3.9

运行nvcc -V: C:\Users\HUAOPT>nvcc -V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2019 NVIDIA Corporation Built on Wed_Oct_23_19:32:27_Pacific_Daylight_Time_2019 Cuda compilation tools, release 10.2, V10.2.89

运行NVIDIA-SMI: Tue Aug 24 10:59:35 2021 +-------------------------------------------------------------------------------+ | NVIDIA-SMI 456.71 Driver Version: 456.71 CUDA Version: 11.1 | +-------------------------------------------------------------------------------+

训练配置文件:

数据集配置

DATASET: DATA_DIR: "./dataset/optic_disc_seg/" NUM_CLASSES: 2 TEST_FILE_LIST: "./dataset/optic_disc_seg/test_list.txt" TRAIN_FILE_LIST: "./dataset/optic_disc_seg/train_list.txt" VAL_FILE_LIST: "./dataset/optic_disc_seg/val_list.txt" VIS_FILE_LIST: "./dataset/optic_disc_seg/test_list.txt"

预训练模型配置

MODEL: MODEL_NAME: "deeplabv3p" DEFAULT_NORM_TYPE: "bn" DEEPLAB: BACKBONE: "xception_65"

其他配置

TRAIN_CROP_SIZE: (512, 512) EVAL_CROP_SIZE: (512, 512) AUG: AUG_METHOD: "unpadding" FIX_RESIZE_SIZE: (512, 512) BATCH_SIZE: 1 TRAIN: PRETRAINED_MODEL_DIR: "./pretrained_model/deeplabv3p_xception65_bn_coco/" MODEL_SAVE_DIR: "./saved_model/deeplabv3p_xception65_bn_optic/" SNAPSHOT_EPOCH: 5 TEST: TEST_MODEL: "./saved_model/deeplabv3p_xception65_bn_optic/final" SOLVER: NUM_EPOCHS: 10 LR: 0.001 LR_POLICY: "poly" OPTIMIZER: "adam"

训练指令: python pdseg/train.py --cfg ./configs/deeplabv3p_xception65_optic.yaml

教程地址: https://www.paddlepaddle.org.cn/modelbasedetail/deeplabv3plus

在多次训练时,发现多次会运行到“Use multi-thread reader”就卡住不动(时间超过12小时),偶尔几次可以正常运行,但是会报错,错误如下:Cuda error(719),unspecified launch failure。

请问是什么问题导致的呢?目前的环境我运行PaddleDetection没有任何问题,就PaddleSeg会这样。

liuhuaguang commented 3 years ago

详细过程数据如下: ****NUM_TRAINERS** 1 {'AUG': {'AUG_METHOD': 'unpadding', 'FIX_RESIZE_SIZE': (512, 512), 'FLIP': False, 'FLIP_RATIO': 0.5, 'INF_RESIZE_VALUE': 500, 'MAX_RESIZE_VALUE': 600, 'MAX_SCALE_FACTOR': 2.0, 'MIN_RESIZE_VALUE': 400, 'MIN_SCALE_FACTOR': 0.5, 'MIRROR': True, 'RICH_CROP': {'ASPECT_RATIO': 0.33, 'BLUR': False, 'BLUR_RATIO': 0.1, 'BRIGHTNESS_JITTER_RATIO': 0.5, 'CONTRAST_JITTER_RATIO': 0.5, 'ENABLE': False, 'MAX_ROTATION': 15, 'MIN_AREA_RATIO': 0.5, 'SATURATION_JITTER_RATIO': 0.5}, 'SCALE_STEP_SIZE': 0.25, 'TO_RGB': False}, 'BATCH_SIZE': 4, 'DATALOADER': {'BUF_SIZE': 256, 'NUM_WORKERS': 8}, 'DATASET': {'DATA_DIM': 3, 'DATA_DIR': './dataset/optic_disc_seg/', 'IGNORE_INDEX': 255, 'IMAGE_TYPE': 'rgb', 'NUM_CLASSES': 2, 'PADDING_VALUE': [127.5, 127.5, 127.5], 'SEPARATOR': ' ', 'TEST_FILE_LIST': './dataset/optic_disc_seg/test_list.txt', 'TEST_TOTAL_IMAGES': 38, 'TRAIN_FILE_LIST': './dataset/optic_disc_seg/train_list.txt', 'TRAIN_TOTAL_IMAGES': 267, 'VAL_FILE_LIST': './dataset/optic_disc_seg/val_list.txt', 'VAL_TOTAL_IMAGES': 76, 'VIS_FILE_LIST': './dataset/optic_disc_seg/test_list.txt'}, 'EVAL_CROP_SIZE': (512, 512), 'FREEZE': {'MODEL_FILENAME': 'model', 'PARAMS_FILENAME': 'params', 'SAVE_DIR': 'freeze_model'}, 'MEAN': [0.5, 0.5, 0.5], 'MODEL': {'BN_MOMENTUM': 0.99, 'DEEPLAB': {'ALIGN_CORNERS': True, 'ASPP_WITH_SEP_CONV': True, 'BACKBONE': 'xception_65', 'BACKBONE_LR_MULT_LIST': None, 'BENCHMARK': False, 'BIAS': False, 'DECODER': {'ACT': True, 'CONV_FILTERS': 256, 'OUTPUT_IS_LOGITS': False, 'USE_SUM_MERGE': False}, 'DECODER_USE_SEP_CONV': True, 'DEPTH_MULTIPLIER': 1.0, 'ENABLE_DECODER': True, 'ENCODER': {'ADD_IMAGE_LEVEL_FEATURE': True, 'ASPP_CONVS_FILTERS': 256, 'ASPP_RATIOS': None, 'ASPP_WITH_CONCAT_PROJECTION': True, 'ASPP_WITH_SE': False, 'POOLING_CROP_SIZE': None, 'POOLING_STRIDE': [1, 1], 'SE_USE_QSIGMOID': False}, 'ENCODER_WITH_ASPP': True, 'OUTPUT_STRIDE': 16}, 'DEFAULT_EPSILON': 1e-05, 'DEFAULT_GROUP_NUMBER': 32, 'DEFAULT_NORM_TYPE': 'bn', 'FP16': False, 'HRNET': {'ALIGN_CORNERS': True, 'BIAS': False, 'STAGE2': {'NUM_CHANNELS': [40, 80], 'NUM_MODULES': 1}, 'STAGE3': {'NUM_CHANNELS': [40, 80, 160], 'NUM_MODULES': 4}, 'STAGE4': {'NUM_CHANNELS': [40, 80, 160, 320], 'NUM_MODULES': 3}}, 'ICNET': {'DEPTH_MULTIPLIER': 0.5, 'LAYERS': 50}, 'MODEL_NAME': 'deeplabv3p', 'MULTI_LOSS_WEIGHT': [1.0], 'OCR': {'OCR_KEY_CHANNELS': 256, 'OCR_MID_CHANNELS': 512}, 'PSPNET': {'DEPTH_MULTIPLIER': 1, 'LAYERS': 50}, 'SCALE_LOSS': 'DYNAMIC', 'UNET': {'UPSAMPLE_MODE': 'bilinear'}}, 'NUM_TRAINERS': 1, 'SLIM': {'KNOWLEDGE_DISTILL': False, 'KNOWLEDGE_DISTILL_IS_TEACHER': False, 'KNOWLEDGE_DISTILL_TEACHER_MODEL_DIR': '', 'NAS_ADDRESS': '', 'NAS_IS_SERVER': True, 'NAS_PORT': 23333, 'NAS_SEARCH_STEPS': 100, 'NAS_SPACE_NAME': '', 'NAS_START_EVAL_EPOCH': 0, 'PREPROCESS': False, 'PRUNE_PARAMS': '', 'PRUNE_RATIOS': []}, 'SOLVER': {'BEGIN_EPOCH': 1, 'CROSS_ENTROPY_WEIGHT': None, 'DECAY_EPOCH': [10, 20], 'GAMMA': 0.1, 'LOSS': ['softmax_loss'], 'LOSS_WEIGHT': {'BCE_LOSS': 1, 'DICE_LOSS': 1, 'LOVASZ_HINGE_LOSS': 1, 'LOVASZ_SOFTMAX_LOSS': 1, 'SOFTMAX_LOSS': 1}, 'LR': 0.001, 'LR_POLICY': 'poly', 'LR_WARMUP': False, 'LR_WARMUP_STEPS': 2000, 'MOMENTUM': 0.9, 'MOMENTUM2': 0.999, 'NUM_EPOCHS': 10, 'OPTIMIZER': 'adam', 'POWER': 0.9, 'WEIGHT_DECAY': 4e-05}, 'STD': [0.5, 0.5, 0.5], 'TEST': {'TEST_MODEL': './saved_model/deeplabv3p_xception65_bn_optic/final'}, 'TRAIN': {'MODEL_SAVE_DIR': './saved_model/deeplabv3p_xception65_bn_optic/', 'PRETRAINED_MODEL_DIR': './pretrained_model/deeplabv3p_xception65_bn_coco/', 'RESUME_MODEL_DIR': '', 'SNAPSHOT_EPOCH': 5, 'SYNC_BATCH_NORM': False}, 'TRAINER_ID': 0, 'TRAIN_CROP_SIZE': (512, 512)} !!! The CPU_NUM is not specified, you should set CPU_NUM in the environment variable list. CPU_NUM indicates that how many CPUPlace are used in the current task. And if this parameter are set as N (equal to the number of physical CPU core) the program may be faster.

export CPU_NUM=6 # for example, set CPU_NUM as number of physical CPU core which is 6.

!!! The default number of CPU_NUM=1.

Device count: 1

batch_size_per_dev: 4 C:\Users\HUAOPT\AppData\Local\Programs\Python\Python39\lib\site-packages\paddle\fluid\layers\math_op_patch.py:294: UserWarning: D:\PaddleSeg\legacy\pdseg\models\backbone\xception.py:295 The behavior of expression A + B has been unified with elementwise_add(X, Y, axis=-1) from Paddle 2.0. If your code works well in the older versions but crashes in this version, try to use elementwise_add(X, Y, axis=0) instead of A + B. This transitional warning will be dropped in the future. warnings.warn( D:\PaddleSeg\legacy\pdseg\loss.py:38: DeprecationWarning: Warning: API "paddle.nn.functional.loss.softmax_with_cross_entropy" is deprecated since 2.0.0, and will be removed in future versions. Please use "paddle.nn.functional.cross_entropy" instead. reason: Please notice that behavior of "paddle.nn.functional.softmax_with_cross_entropy" and "paddle.nn.functional.cross_entropy" is different. loss, probs = F.softmax_with_cross_entropy( C:\Users\HUAOPT\AppData\Local\Programs\Python\Python39\lib\site-packages\paddle\fluid\layers\math_op_patch.py:294: UserWarning: D:\PaddleSeg\legacy\pdseg\loss.py:78 The behavior of expression A B has been unified with elementwise_mul(X, Y, axis=-1) from Paddle 2.0. If your code works well in the older versions but crashes in this version, try to use elementwise_mul(X, Y, axis=0) instead of A B. This transitional warning will be dropped in the future. warnings.warn( D:\PaddleSeg\legacy\pdseg\utils\load_model_utils.py:26: DeprecationWarning: The binary mode of fromstring is deprecated, as it behaves surprisingly on unicode inputs. Use frombuffer instead version = np.fromstring(f.read(4), dtype='int32') D:\PaddleSeg\legacy\pdseg\utils\load_model_utils.py:27: DeprecationWarning: The binary mode of fromstring is deprecated, as it behaves surprisingly on unicode inputs. Use frombuffer instead lod_level = np.fromstring(f.read(8), dtype='int64') D:\PaddleSeg\legacy\pdseg\utils\load_model_utils.py:31: DeprecationWarning: The binary mode of fromstring is deprecated, as it behaves surprisingly on unicode inputs. Use frombuffer instead version = np.fromstring(f.read(4), dtype='int32') D:\PaddleSeg\legacy\pdseg\utils\load_model_utils.py:33: DeprecationWarning: The binary mode of fromstring is deprecated, as it behaves surprisingly on unicode inputs. Use frombuffer instead tensor_desc_size = np.fromstring(f.read(4), dtype='int32') [SKIP] Shape of pretrained weight ./pretrained_model/deeplabv3p_xception65_bn_coco//logit/weights doesn't match.(Pretrained: (21, 256, 1, 1), Actual: (2, 256, 1, 1)) [SKIP] Shape of pretrained weight ./pretrained_model/deeplabv3p_xception65_bn_coco//logit/biases doesn't match.(Pretrained: (21,), Actual: (2,)) There are 730/732 varaibles in ./pretrained_model/deeplabv3p_xception65_bn_coco/ are loaded. Use multi-thread reader

wuyefeilin commented 3 years ago

看你的使用是老版本的教程,青参考如下教程在2.2分支上运行哈 https://github.com/PaddlePaddle/PaddleSeg/blob/release/2.2/docs/quick_start_cn.md