使用Celeb人脸数据集做分割，按照文档处理数据和yaml文件，但是训练出现问题

Diuyon commented 4 years ago

1. 问题报告如下：

----------------------
Error Message Summary:
----------------------
Error: An error occurred here. There is no accurate error hint for this error yet. We are continuously in the process of increasing hint for this kind of error check. It would be helpful if you could inform us of how this conversion went by opening a github issue. And we will resolve it with high priority.
  - New issue link: https://github.com/PaddlePaddle/Paddle/issues/new
  - Recommended issue content: all error stack information: unspecified launch failure at (D:\1.7.1\paddle\paddle\fluid\operators\reader\buffered_reader.cc:115)
  [operator < read > error]

2. 详细错误日志如下：


--------------------------------------------
C++ Call Stacks (More useful to developers):
--------------------------------------------
Windows not support stack backtrace yet.
------------------------------------------
Python Call Stacks (More useful to users):
------------------------------------------
  File "C:\ProgramData\Anaconda3\envs\work\lib\site-packages\paddle\fluid\framework.py", line 2525, in append_op
    attrs=kwargs.get("attrs", None))
  File "C:\ProgramData\Anaconda3\envs\work\lib\site-packages\paddle\fluid\reader.py", line 733, in _init_non_iterable
    outputs={'Out': self._feed_list})
  File "C:\ProgramData\Anaconda3\envs\work\lib\site-packages\paddle\fluid\reader.py", line 646, in __init__
    self._init_non_iterable()
  File "C:\ProgramData\Anaconda3\envs\work\lib\site-packages\paddle\fluid\reader.py", line 280, in from_generator
    iterable, return_list)
  File "C:\ProgramData\Anaconda3\envs\work\lib\site-packages\paddle\fluid\reader.py", line 1046, in __init__
    feed_list, capacity, use_double_buffer, iterable, return_list)
  File "D:\work\脸部裁剪\paddleSeg\pdseg\models\model_builder.py", line 200, in build_model
    use_double_buffer=True)
  File "pdseg/train.py", line 256, in train
    train_prog, startup_prog, phase=ModelPhase.TRAIN)
  File "pdseg/train.py", line 505, in main
    train(cfg)
  File "pdseg/train.py", line 518, in <module>
    main(args)

3. 错误描述如下：


日志不停在终端出现（刷屏），不得已终止训练

如果您能帮忙解决该问题，我将万分感谢！

LutaoChu commented 4 years ago

你好，看起来是使用py_reader过程出现的问题。可否提供一下复现环境，比如训练配置、PaddleSeg、Paddle版本、几张样例图片

Diuyon commented 4 years ago

好的

yaml文件内容如下

BATCH_SIZE : 2
TRAIN_CROP_SIZE : (512, 512)
EVAL_CROP_SIZE : (1000, 1000)

# 数据集配置
DATASET: 
    DATA_DIR : "../../data/CelebAMask/"
    TRAIN_FILE_LIST : "../../data/CelebAMask/train.txt"
    VAL_FILE_LIST: "../../data/CelebAMask/validation.txt"
    TEST_FILE_LIST: "../../data/CelebAMask/test.txt"
    VIS_FILE_LIST: "../../data/CelebAMask/validation.txt"
    NUM_CLASSES: 4

# 模型配置
MODEL:
    MODEL_NAME: "deeplabv3p"
    DEFAULT_NORM_TYPE: "bn"
    DEEPLAB:
        BACKBONE: "xception_65"

# 数据增强
AUG:
    AUG_METHOD: "stepscaling"
    FIX_RESIZE_SIZE: (512, 512)

TRAIN:
    PRETRAINED_MODEL_DIR: "./pretrained_model/deeplabv3p_xception65_bn_coco"
    MODEL_SAVE_DIR: "./saved_model/deeplabv3p_xception65_headseg/"
    SNAPSHOT_EPOCH: 10
TEST:
    TEST_MODEL: ""

FREEZE:
    MODEL_FILENAME: "model"
    PARAMS_FILENAME: "params"

# 设置优化参数
SOLVER:
    NUM_EPOCHS: 50
    LR: 0.001
    LR_POLICY: "poly"
    OPTIMIZER: "adam"

示例图片如下

images(origin)
mask(label)，类别共有4个，分别为：头发(2)、面部(1)、耳环(3)、以及出前3者外的区域(0)
二值化展示图

paddle版本信息

paddlehub               1.5.4
paddlepaddle-gpu        1.7.1.post107
paddleSeg为最新拉取版本

LutaoChu commented 4 years ago

你用的是PaddleHub进行训练的？ label中3个类别的像素值是0，1，2吗？背景也就是二值图中黑色区域是多少呢，标为255吗？

Diuyon commented 4 years ago

哦哦，我知道了，应该是这个问题，尴尬￣□￣｜｜

LutaoChu commented 4 years ago

好的，我们的标注协议是从0开始，0，1，2递增。默认ignore的类别是255

LutaoChu commented 4 years ago

对了，训练之前最好使用pdseg/check.py检查一下数据和配置，就可以及早发现这些问题了

Diuyon commented 4 years ago

我是有检查过的，检查通过了，我刚刚修改了NUM_CLASSES: 4，但是仍然出现了这个问题

LutaoChu commented 4 years ago

你的label像素标的是0，1，2，3 还是其他呢？

Diuyon commented 4 years ago

在mask当中，我对值得设置是按照0-255的规则

Diuyon commented 4 years ago

我刚刚检查了一下，mask当中存在标签为4的情况，我现在去修改

Diuyon commented 4 years ago

我修改了mask，还是出现了这个问题，我尝试将mask改成2分类（前后景）,问题就没了，这是为什么呢？

上面的图片信息，我修改成了只存在0,1,2,3 这4种类别的版本

LutaoChu commented 4 years ago

应该还是mask标注问题。你是怎么修改成2分类的？

Diuyon commented 4 years ago

原先：1: 脸部; 2: 头发; 3: 耳饰; 0: 除前3者外的区域二分类: 1: 脸部、头发、耳饰；0：除前者意外的区域

LutaoChu commented 4 years ago

确实有点奇怪。原先的数据跑一下check.py，将输出结果detail.log发出来吧

Diuyon commented 4 years ago

detail.log 日志

PASS ../../data/CelebAMask/test.txt DATASET.SEPARATOR check

PASS ../../data/CelebAMask/test.txt DATASET.SEPARATOR check
2020-03-23 16:36:03,469-INFO:
PASS dataset reading check

PASS dataset reading check
2020-03-23 16:36:03,469-INFO: All images can be read successfully
All images can be read successfully
2020-03-23 16:36:03,471-INFO:
PASS label gray check

PASS label gray check
2020-03-23 16:36:03,471-INFO: All label images are gray
All label images are gray
2020-03-23 16:36:03,471-INFO:
PASS label format check

PASS label format check
2020-03-23 16:36:03,472-INFO: total 6000 label images are png format, 0 label images are not png format
total 6000 label images are png format, 0 label images are not png format
2020-03-23 16:36:03,472-INFO:
Doing label pixel statistics:
(label class, total pixel number, percentage) = [(0, 576041840, 0.3662), (1, 523887858, 0.3331), (2, 468858689, 0.2981), (3, 4075613, 0.0026)]

Doing label pixel statistics:
(label class, total pixel number, percentage) = [(0, 576041840, 0.3662), (1, 523887858, 0.3331), (2, 468858689, 0.2981), (3, 4075613, 0.0026)]
2020-03-23 16:36:03,473-INFO:
PASS label class check!

PASS label class check!
2020-03-23 16:36:03,488-INFO:
PASS DATASET.IMAGE_TYPE check

PASS DATASET.IMAGE_TYPE check
2020-03-23 16:36:03,488-INFO:
Doing max image size statistics:

Doing max image size statistics:
2020-03-23 16:36:03,488-INFO: max width and max height of images are (512,512)
max width and max height of images are (512,512)
2020-03-23 16:36:03,489-INFO:
PASS shape check

PASS shape check
2020-03-23 16:36:03,489-INFO: All images are the same shape as the labels
All images are the same shape as the labels
2020-03-23 16:36:03,489-INFO:
PASS EVAL_CROP_SIZE check

PASS EVAL_CROP_SIZE check
2020-03-23 16:36:03,489-INFO: satisfy current EVAL_CROP_SIZE: (1000,1000) >= max width and max height of images: (512,512)
satisfy current EVAL_CROP_SIZE: (1000,1000) >= max width and max height of images: (512,512)

Detailed error information can be viewed in detail.log file.

LutaoChu commented 4 years ago

这是test.txt，有train.txt的吗

LutaoChu commented 4 years ago

直接把整个detail.log发出来吧

Diuyon commented 4 years ago

detail.log

LutaoChu commented 4 years ago

PASS ../../data/CelebAMask/test.txt DATASET.SEPARATOR check 测试集check结果没问题为什么只check了测试集呢？是训练过程出错，应该check下训练集

Diuyon commented 4 years ago

训练集也是通过的，在我发的文件detail.log中有所有check记录

LutaoChu commented 4 years ago

看到了。check记录看起来正常，要不发我一些图片我复现一下 chulutao@baidu.com

Diuyon commented 4 years ago

邮件已发送

LutaoChu commented 4 years ago

NOT PASS loss check. Dice loss and bce loss is only applicable to binary classfication 我这边check的结果，dice loss和bce loss只能在2分类的时候用，多分类目前不行

Diuyon commented 4 years ago

可是我配置文件中并没有写明用哪个loss，按道理它会默认使用softmax吧？如果我要使得多分类任务正常运行，是不是要指明LOSS: ["softmax_loss"]?

LutaoChu commented 4 years ago

是的，默认用softmax，不需要指名。我用的是你上面给的yaml

Diuyon commented 4 years ago

那现在该怎么办呢？

LutaoChu commented 4 years ago

提供一下你后来的yaml吧 git branch看看你当前的PaddleSeg版本

Diuyon commented 4 years ago

版本：release/v0.4.0 yaml：如果是多分类就仍然还是上面哪个，二分类则为

BATCH_SIZE : 16
TRAIN_CROP_SIZE : (512, 512)
EVAL_CROP_SIZE : (1000, 1000)

# 数据集配置
DATASET: 
    DATA_DIR : "../../data/data25995/"
    TRAIN_FILE_LIST : "../../data/data25995/train.txt"
    VAL_FILE_LIST: "../../data/data25995/validation.txt"
    TEST_FILE_LIST: "../../data/data25995/test.txt"
    VIS_FILE_LIST: "../../data/data25995/validation.txt"
    NUM_CLASSES: 2

# 模型配置
MODEL:
    MODEL_NAME: "deeplabv3p"
    DEFAULT_NORM_TYPE: "bn"
    DEEPLAB:
        BACKBONE: "xception_65"

# 数据增强
AUG:
    AUG_METHOD: "stepscaling"
    FIX_RESIZE_SIZE: (512, 512)

TRAIN:
    PRETRAINED_MODEL_DIR: "./saved_model/deeplabv3p_xception65_headseg/40"
    MODEL_SAVE_DIR: "./saved_model/deeplabv3p_xception65_headseg_2/"
    SNAPSHOT_EPOCH: 10
TEST:
    TEST_MODEL: "./saved_model/deeplabv3p_xception65_headseg_2/final/"

FREEZE:
    MODEL_FILENAME: "model"
    PARAMS_FILENAME: "params"

# 设置优化参数
SOLVER:
    NUM_EPOCHS: 20
    LR: 0.001
    LR_POLICY: "poly"
    OPTIMIZER: "adam"

PaddlePaddle / PaddleSeg