ubuntu上报错label expected >= 0 and < 2, or == 255, but got 89，但是模型配置和数据集都没有问题。

问题确认 Search before asking

[X] 我已经查询历史issue(包括open与closed)，没有发现相似的bug。I have searched the open and closed issues and found no similar bug report.

Bug描述 Describe the Bug

paddle的老用户了，直接按照教程中‘快速开始’部分的教程开始跑，模型配置文件咱直接用默认的，，数据也是完全使用公开的数据集optic_disc_seg,configs/quick_start/pp_liteseg_optic_disc_512x512_1k.yml中的设置完全按照教程中的来的，

第一次ubuntu尝试，报错数据如下： (paddle_seg) xuqing@dell-PowerEdge-R740:~/projects/paddle_seg/PaddleSeg$ python tools/train.py --config configs/quick_start/pp_liteseg_optic_disc_512x512_1k.yml --save_interval 500 --do_eval --use_vdl --save_dir output 2024-08-02 10:12:37 [WARNING] Add the `num_classes` in train_dataset and val_dataset config to model config. We suggest you manually set `num_classes` in model config. 2024-08-02 10:12:38 [INFO]
------------Environment Information------------- platform: Linux-6.5.0-35-generic-x86_64-with-glibc2.35 Python: 3.9.19 (main, Apr 6 2024, 17:57:55) [GCC 11.4.0] Paddle compiled with cuda: True NVCC: Build cuda_11.8.r11.8/compiler.31833905_0 cudnn: 8.6 GPUs used: 1 CUDA_VISIBLE_DEVICES: None GPU: ['GPU 0: NVIDIA GeForce', 'GPU 1: NVIDIA GeForce'] GCC: gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 PaddleSeg: 0.0.0.dev0 PaddlePaddle: 2.6.1 OpenCV: 4.10.0

2024-08-02 10:12:38 [INFO]
---------------Config Information--------------- batch_size: 4 iters: 1000 train_dataset: dataset_root: data/optic_disc_seg mode: train num_classes: 2 train_path: data/optic_disc_seg/train_list.txt transforms:

max_scale_factor: 2.0 min_scale_factor: 0.5 scale_step_size: 0.25 type: ResizeStepScaling
crop_size:
- 512
- 512 type: RandomPaddingCrop
type: RandomHorizontalFlip
brightness_range: 0.5 contrast_range: 0.5 saturation_range: 0.5 type: RandomDistort
type: Normalize type: Dataset val_dataset: dataset_root: data/optic_disc_seg mode: val num_classes: 2 transforms:
type: Normalize type: Dataset val_path: data/optic_disc_seg/val_list.txt optimizer: momentum: 0.9 type: SGD weight_decay: 4.0e-05 lr_scheduler: end_lr: 0 learning_rate: 0.01 power: 0.9 type: PolynomialDecay loss: coef:
1
1
1 types:
type: CrossEntropyLoss
type: CrossEntropyLoss
type: CrossEntropyLoss model: backbone: pretrained: https://bj.bcebos.com/paddleseg/dygraph/PP_STDCNet2.tar.gz type: STDC2 num_classes: 2 type: PPLiteSeg

2024-08-02 10:12:38 [INFO] Set device: gpu 2024-08-02 10:12:38 [INFO] Use the following config to build model model: backbone: pretrained: https://bj.bcebos.com/paddleseg/dygraph/PP_STDCNet2.tar.gz type: STDC2 num_classes: 2 type: PPLiteSeg W0802 10:12:38.203576 221128 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.9, Driver API Version: 12.2, Runtime API Version: 11.8 W0802 10:12:38.203653 221128 gpu_resources.cc:164] device: 0, cuDNN Version: 8.6. 2024-08-02 10:12:38 [INFO] Loading pretrained model from https://bj.bcebos.com/paddleseg/dygraph/PP_STDCNet2.tar.gz 2024-08-02 10:12:38 [INFO] There are 265/265 variables loaded into STDCNet. 2024-08-02 10:12:38 [INFO] Use the following config to build train_dataset train_dataset: dataset_root: data/optic_disc_seg mode: train num_classes: 2 train_path: data/optic_disc_seg/train_list.txt transforms:

max_scale_factor: 2.0 min_scale_factor: 0.5 scale_step_size: 0.25 type: ResizeStepScaling
crop_size:
- 512
- 512 type: RandomPaddingCrop
type: RandomHorizontalFlip
brightness_range: 0.5 contrast_range: 0.5 saturation_range: 0.5 type: RandomDistort
type: Normalize type: Dataset 2024-08-02 10:12:38 [INFO] Use the following config to build val_dataset val_dataset: dataset_root: data/optic_disc_seg mode: val num_classes: 2 transforms:
type: Normalize type: Dataset val_path: data/optic_disc_seg/val_list.txt 2024-08-02 10:12:38 [INFO] If the type is SGD and momentum in optimizer config, the type is changed to Momentum. 2024-08-02 10:12:38 [INFO] Use the following config to build optimizer optimizer: momentum: 0.9 type: Momentum weight_decay: 4.0e-05 2024-08-02 10:12:38 [INFO] Use the following config to build loss loss: coef:
1
1
1 types:
type: CrossEntropyLoss
type: CrossEntropyLoss
type: CrossEntropyLoss /home/xuqing/projects/paddle_seg/lib/python3.9/site-packages/paddle/nn/layer/norm.py:824: UserWarning: When training, we now always track global mean and variance. warnings.warn( Error: /paddle/paddle/phi/kernels/gpu/cross_entropy_kernel.cu:998 Assertion false failed. The value of label expected >= 0 and < 2, or == 255, but got 218. Please check label value. Error: /paddle/paddle/phi/kernels/gpu/cross_entropy_kernel.cu:998 Assertion false failed. The value of label expected >= 0 and < 2, or == 255, but got 218. Please check label value. Error: /paddle/paddle/phi/kernels/gpu/cross_entropy_kernel.cu:998 Assertion false failed. The value of label expected >= 0 and < 2, or == 255, but got 89. Please check label value. Error: /paddle/paddle/phi/kernels/gpu/cross_entropy_kernel.cu:998 Assertion false failed. The value of label expected >= 0 and < 2, or == 255, but got 89. Please check label value. Error: /paddle/paddle/phi/kernels/gpu/cross_entropy_kernel.cu:998 Assertion false failed. The value of label expected >= 0 and < 2, or == 255, but got 218. Please check label value. Error: /paddle/paddle/phi/kernels/gpu/cross_entropy_kernel.cu:998 Assertion false failed. The value of label expected >= 0 and < 2, or == 255, but got 218. Please check label value. Error: /paddle/paddle/phi/kernels/gpu/cross_entropy_kernel.cu:998 Assertion false failed. The value of label expected >= 0 and < 2, or == 255, but got 89. Please check label value. Error: /paddle/paddle/phi/kernels/gpu/cross_entropy_kernel.cu:998 Assertion false failed. The value of label expected >= 0 and < 2, or == 255, but got 89. Please check label value. Error: /paddle/paddle/phi/kernels/gpu/cross_entropy_kernel.cu:998 Assertion false failed. The value of label expected >= 0 and < 2, or == 255, but got 218. Please check label value. Error: /paddle/paddle/phi/kernels/gpu/cross_entropy_kernel.cu:998 Assertion false failed. The value of label expected >= 0 and < 2, or == 255, but got 218. Please check label value. Error: /paddle/paddle/phi/kernels/gpu/cross_entropy_kernel.cu:998 Assertion false failed. The value of label expected >= 0 and < 2, or == 255, but got 89. Please check label value. Error: /paddle/paddle/phi/kernels/gpu/cross_entropy_kernel.cu:998 Assertion false failed. The value of label expected >= 0 and < 2, or == 255, but got 89. Please check label value. Error: /paddle/paddle/phi/kernels/gpu/cross_entropy_kernel.cu:998 Assertion false failed. The value of label expected >= 0 and < 2, or == 255, but got 218. Please check label value. Error: /paddle/paddle/phi/kernels/gpu/cross_entropy_kernel.cu:998 Assertion false failed. The value of label expected >= 0 and < 2, or == 255, but got 218. Please check label value. Error: /paddle/paddle/phi/kernels/gpu/cross_entropy_kernel.cu:998 Assertion false failed. The value of label expected >= 0 and < 2, or == 255, but got 89. Please check label value. Error: /paddle/paddle/phi/kernels/gpu/cross_entropy_kernel.cu:998 Assertion false failed. The value of label expected >= 0 and < 2, or == 255, but got 89. Please check label value. Error: /paddle/paddle/phi/kernels/gpu/cross_entropy_kernel.cu:998 Assertion false failed. The value of label expected >= 0 and < 2, or == 255, but got 218. Please check label value. Error: /paddle/paddle/phi/kernels/gpu/cross_entropy_kernel.cu:998 Assertion false failed. The value of label expected >= 0 and < 2, or == 255, but got 218. Please check label value. Error: /paddle/paddle/phi/kernels/gpu/cross_entropy_kernel.cu:998 Assertion false failed. The value of label expected >= 0 and < 2, or == 255, but got 89. Please check label value. Error: /paddle/paddle/phi/kernels/gpu/cross_entropy_kernel.cu:998 Assertion false failed. The value of label expected >= 0 and < 2, or == 255, but got 89. Please check label value. Error: /paddle/paddle/phi/kernels/gpu/cross_entropy_kernel.cu:998 Assertion false failed. The value of label expected >= 0 and < 2, or == 255, but got 218. Please check label value. Error: /paddle/paddle/phi/kernels/gpu/cross_entropy_kernel.cu:998 Assertion false failed. The value of label expected >= 0 and < 2, or == 255, but got 218. Please check label value. Error: /paddle/paddle/phi/kernels/gpu/cross_entropy_kernel.cu:998 Assertion false failed. The value of label expected >= 0 and < 2, or == 255, but got 89. Please check label value. Error: /paddle/paddle/phi/kernels/gpu/cross_entropy_kernel.cu:998 Assertion false failed. The value of label expected >= 0 and < 2, or == 255, but got 89. Please check label value. Error: /paddle/paddle/phi/kernels/gpu/cross_entropy_kernel.cu:998 Assertion false failed. The value of label expected >= 0 and < 2, or == 255, but got 218. Please check label value. Error: /paddle/paddle/phi/kernels/gpu/cross_entropy_kernel.cu:998 Assertion false failed. The value of label expected >= 0 and < 2, or == 255, but got 218. Please check label value. Error: /paddle/paddle/phi/kernels/gpu/cross_entropy_kernel.cu:998 Assertion false failed. The value of label expected >= 0 and < 2, or == 255, but got 89. Please check label value. Error: /paddle/paddle/phi/kernels/gpu/cross_entropy_kernel.cu:998 Assertion false failed. The value of label expected >= 0 and < 2, or == 255, but got 89. Please check label value. Error: /paddle/paddle/phi/kernels/gpu/cross_entropy_kernel.cu:998 Assertion false failed. The value of label expected >= 0 and < 2, or == 255, but got 218. Please check label value. Error: /paddle/paddle/phi/kernels/gpu/cross_entropy_kernel.cu:998 Assertion false failed. The value of label expected >= 0 and < 2, or == 255, but got 218. Please check label value. Error: /paddle/paddle/phi/kernels/gpu/cross_entropy_kernel.cu:998 Assertion false failed. The value of label expected >= 0 and < 2, or == 255, but got 89. Please check label value. Error: /paddle/paddle/phi/kernels/gpu/cross_entropy_kernel.cu:998 Assertion false failed. The value of label expected >= 0 and < 2, or == 255, but got 89. Please check label value. Traceback (most recent call last): File "/home/xuqing/projects/paddle_seg/PaddleSeg/tools/train.py", line 219, in main(args) File "/home/xuqing/projects/paddle_seg/PaddleSeg/tools/train.py", line 193, in main train( File "/home/xuqing/projects/paddle_seg/PaddleSeg/paddleseg/core/train.py", line 247, in train loss.backward() File "/home/xuqing/projects/paddle_seg/lib/python3.9/site-packages/decorator.py", line 232, in fun return caller(func, *(extras + args), kw) File "/home/xuqing/projects/paddle_seg/lib/python3.9/site-packages/paddle/base/wrapped_decorator.py", line 26, in impl return wrapped_func(*args, *kwargs) File "/home/xuqing/projects/paddle_seg/lib/python3.9/site-packages/paddle/base/framework.py", line 593, in impl return func(args, kwargs) File "/home/xuqing/projects/paddle_seg/lib/python3.9/site-packages/paddle/base/dygraph/tensor_patch_methods.py", line 342, in backward core.eager.run_backward([self], grad_tensor, retain_graph) OSError: (External) CUDA error(719), unspecified launch failure. [Hint: 'cudaErrorLaunchFailure'. An exception occurred on the device while executing a kernel. Common causes include dereferencing an invalid device pointerand accessing out of bounds shared memory. Less common cases can be system specific - more information about these cases canbe found in the system specific user guide. This leaves the process in an inconsistent state and any further CUDA work willreturn the same error. To continue using CUDA, the process must be terminated and relaunched.] (at /paddle/paddle/phi/backends/gpu/cuda/cuda_info.cc:265) 行，咱拿到了两个信息：a, The value of label expected >= 0 and < 2, or == 255, but got 89. Please check label value,从这个可以推断出，labels设置不对，或者是数据集中的labels不对，反正就是数据集中的labels和设置中的num_classes不匹配。b，报错信息来源于cc或者是cu文件，估计python级别的debug可能解决不了问题。c或者c++方面的，，那可就麻烦了啊。那咱们先看看数据集吧，在optic_disc_seg/Annotations中随便打开一张图，发现背景是0，标注的数据是红色的，直觉不对啊，，我记得之前我跑的成功的数据格式明明是这样的，假设我的labels如下：背景，车辆，人，那一张图片中，背景部分的像素是0，车辆部分的像素是1，人的部分像素是2，，那行，那咱们手动将数据改一下咯，直接np.clip(image,0,1)，反正只有两类，然后我再跑，，仍旧报错如上，，，使用python -m paddle.distributed.launch tools/train.py多卡同样是报错labels数量对不上。。。

行，咱就是说，除了ubuntu，本地电脑也不是不能用，直接在windows上跑，竟然没想到啊，它完全可以跑的起来，返回如下： (paddleseg) D:\env\paddleseg\PaddleSeg>python tools/train.py --config configs/quick_start/pp_liteseg_optic_disc_512x512_1k.yml --save_interval 500 --do_eval --use_vdl --save_dir output 2024-08-02 10:33:12 [WARNING] Add the `num_classes` in train_dataset and val_dataset config to model config. We suggest you manually set `num_classes` in model config. 2024-08-02 10:33:12 [INFO] ------------Environment Information------------- platform: Windows-10-10.0.19041-SP0 Python: 3.9.2 (tags/v3.9.2:1a79785, Feb 19 2021, 13:44:55) [MSC v.1928 64 bit (AMD64)] Paddle compiled with cuda: True NVCC: Build cuda_11.7.r11.7/compiler.31294372_0 cudnn: 8.4 GPUs used: 1 CUDA_VISIBLE_DEVICES: None GPU: ['GPU 0: NVIDIA GeForce'] GCC: gcc (MinGW-W64 x86_64-posix-seh, built by Brecht Sanders) 11.3.0 PaddleSeg: 2.9.0 PaddlePaddle: 2.5.2 OpenCV: 4.8.1

2024-08-02 10:33:12 [INFO] ---------------Config Information--------------- batch_size: 4 iters: 1000 train_dataset: dataset_root: data/optic_disc_seg mode: train num_classes: 2 train_path: data/optic_disc_seg/train_list.txt transforms:

max_scale_factor: 2.0 min_scale_factor: 0.5 scale_step_size: 0.25 type: ResizeStepScaling
crop_size:
- 512
- 512 type: RandomPaddingCrop
type: RandomHorizontalFlip
brightness_range: 0.5 contrast_range: 0.5 saturation_range: 0.5 type: RandomDistort
type: Normalize type: Dataset val_dataset: dataset_root: data/optic_disc_seg mode: val num_classes: 2 transforms:
type: Normalize type: Dataset val_path: data/optic_disc_seg/val_list.txt optimizer: momentum: 0.9 type: SGD weight_decay: 4.0e-05 lr_scheduler: end_lr: 0 learning_rate: 0.01 power: 0.9 type: PolynomialDecay loss: coef:
1
1
1 types:
type: CrossEntropyLoss
type: CrossEntropyLoss
type: CrossEntropyLoss model: backbone: pretrained: https://bj.bcebos.com/paddleseg/dygraph/PP_STDCNet2.tar.gz type: STDC2 num_classes: 2 type: PPLiteSeg

2024-08-02 10:33:12 [INFO] Set device: gpu 2024-08-02 10:33:12 [INFO] Use the following config to build model model: backbone: pretrained: https://bj.bcebos.com/paddleseg/dygraph/PP_STDCNet2.tar.gz type: STDC2 num_classes: 2 type: PPLiteSeg W0802 10:33:12.746259 17620 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 11.7, Runtime API Version: 11.7 W0802 10:33:12.746259 17620 gpu_resources.cc:149] device: 0, cuDNN Version: 8.4. 2024-08-02 10:33:13 [INFO] Loading pretrained model from https://bj.bcebos.com/paddleseg/dygraph/PP_STDCNet2.tar.gz Connecting to https://bj.bcebos.com/paddleseg/dygraph/PP_STDCNet2.tar.gz Downloading PP_STDCNet2.tar.gz [==================================================] 100.00% Uncompress PP_STDCNet2.tar.gz [==================================================] 100.00% 2024-08-02 10:33:15 [INFO] There are 265/265 variables loaded into STDCNet. 2024-08-02 10:33:15 [INFO] Use the following config to build train_dataset train_dataset: dataset_root: data/optic_disc_seg mode: train num_classes: 2 train_path: data/optic_disc_seg/train_list.txt transforms:

max_scale_factor: 2.0 min_scale_factor: 0.5 scale_step_size: 0.25 type: ResizeStepScaling
crop_size:
- 512
- 512 type: RandomPaddingCrop
type: RandomHorizontalFlip
brightness_range: 0.5 contrast_range: 0.5 saturation_range: 0.5 type: RandomDistort
type: Normalize type: Dataset 2024-08-02 10:33:15 [INFO] Use the following config to build val_dataset val_dataset: dataset_root: data/optic_disc_seg mode: val num_classes: 2 transforms:
type: Normalize type: Dataset val_path: data/optic_disc_seg/val_list.txt 2024-08-02 10:33:15 [INFO] If the type is SGD and momentum in optimizer config, the type is changed to Momentum. 2024-08-02 10:33:15 [INFO] Use the following config to build optimizer optimizer: momentum: 0.9 type: Momentum weight_decay: 4.0e-05 2024-08-02 10:33:15 [INFO] Use the following config to build loss loss: coef:
1
1
1 types:
type: CrossEntropyLoss
type: CrossEntropyLoss
type: CrossEntropyLoss D:\env\paddleseg\lib\site-packages\paddle\nn\layer\norm.py:777: UserWarning: When training, we now always track global mean and variance. warnings.warn( 2024-08-02 10:33:20 [INFO] [TRAIN] epoch: 1, iter: 10/1000, loss: 1.2764, lr: 0.009919, batch_cost: 0.3724, reader_cost: 0.02240, ips: 10.7403 samples/sec | ETA 00:06:08 2024-08-02 10:33:21 [INFO] [TRAIN] epoch: 1, iter: 20/1000, loss: 0.2465, lr: 0.009829, batch_cost: 0.1303, reader_cost: 0.00000, ips: 30.6941 samples/sec | ETA 00:02:07 2024-08-02 10:33:22 [INFO] [TRAIN] epoch: 1, iter: 30/1000, loss: 0.2122, lr: 0.009739, batch_cost: 0.1300, reader_cost: 0.00000, ips: 30.7799 samples/sec | ETA 00:02:06 2024-08-02 10:33:24 [INFO] [TRAIN] epoch: 1, iter: 40/1000, loss: 0.2306, lr: 0.009648, batch_cost: 0.1301, reader_cost: 0.00010, ips: 30.7349 samples/sec | ETA 00:02:04 2024-08-02 10:33:25 [INFO] [TRAIN] epoch: 1, iter: 50/1000, loss: 0.1755, lr: 0.009558, batch_cost: 0.1303, reader_cost: 0.00000, ips: 30.7027 samples/sec | ETA 00:02:03 2024-08-02 10:33:26 [INFO] [TRAIN] epoch: 1, iter: 60/1000, loss: 0.1643, lr: 0.009467, batch_cost: 0.1300, reader_cost: 0.00000, ips: 30.7676 samples/sec | ETA 00:02:02 2024-08-02 10:33:28 [INFO] [TRAIN] epoch: 2, iter: 70/1000, loss: 0.1220, lr: 0.009377, batch_cost: 0.1393, reader_cost: 0.00931, ips: 28.7197 samples/sec | ETA 00:02:09

其他自定义数据（coco的暂时没尝试，就标注是png，图像是jpg的这种普通图像分割的数据类别）也多方尝试，也得到了同样的结果，综上所述，我有两个怀疑： 1，旧版本的paddleseg跑pp_liteseg的模型，不会出问题，新版本的paddleseg（8月1号git clone下来的这个版本）会出现读取Annotations中图像数据的时候有问题，要么是读取灰度图成了三通道图，要么是某个图像包在windows上和ubuntu上的返回不一致 2，windows上的paddleseg和ubuntu上的paddleseg不一样，斗胆猜测是读取数据集这块有个什么问题，导致ubuntu上无法将设置的labels数据和png上的标注的数据对应起来，，

复现环境 Environment

platform: Linux-6.5.0-35-generic-x86_64-with-glibc2.35 Python: 3.9.19 (main, Apr 6 2024, 17:57:55) [GCC 11.4.0] Paddle compiled with cuda: True NVCC: Build cuda_11.8.r11.8/compiler.31833905_0 cudnn: 8.6 GPUs used: 1 CUDA_VISIBLE_DEVICES: None GPU: ['GPU 0: NVIDIA GeForce', 'GPU 1: NVIDIA GeForce'] GCC: gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 PaddleSeg: 0.0.0.dev0 PaddlePaddle: 2.6.1 OpenCV: 4.10.0

备注：ubuntu22.04 按照教程安装并且可以通过运行检查（sh tests/install/check_predict.sh）

Bug描述确认 Bug description confirmation

[X] 我确认已经提供了Bug复现步骤、代码改动说明、以及环境信息，确认问题是可以复现的。I confirm that the bug replication steps, code change instructions, and environment information have been provided, and the problem can be reproduced.

是否愿意提交PR？ Are you willing to submit a PR?

[X] 我愿意提交PR！I'd like to help by submitting a PR!

PaddlePaddle / PaddleSeg