Out of memory error occurs when training UNet_3plus using a custom data set

问题确认 Search before asking

[X] 我已经查询历史issue(包括open与closed)，没有发现相似的bug。I have searched the open and closed issues and found no similar bug report.

Bug描述 Describe the Bug

我参照自定义数据集做了一些标注，并生成了train.txt和val.txt，然后参照unet_3plus_cityscapes_1024x512_160k.yml修改了batch_size、iters和learning_rate：

_base_: '../_base_/cityscapes.yml'

batch_size: 1
iters: 500

lr_scheduler:
  learning_rate: 0.0025

model:
  type: UNet3Plus
  in_channels: 3
  num_classes: 19
  is_batchnorm: True
  is_deepsup: False
  is_CGM: False

然后使用如下命令启动训练任务之后收到报错信息：

python -m tools.train --config ./configs/unet_3plus/unet_3plus_cityscapes_1024x512_160k.yml --do_eval --save_interval 50 --save_dir ouput

报错如下：

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/root/miniconda3/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/data/project/PaddleSeg/tools/train.py", line 213, in <module>
    main(args)
  File "/data/project/PaddleSeg/tools/train.py", line 180, in main
    train_dataset = builder.train_dataset
  File "/data/project/PaddleSeg/paddleseg/utils/utils.py", line 274, in __get__
    val = self.func(obj)
  File "/data/project/PaddleSeg/paddleseg/cvlibs/builder.py", line 264, in train_dataset
    dataset = self.build_component(dataset_cfg)
  File "/data/project/PaddleSeg/paddleseg/cvlibs/builder.py", line 72, in build_component
    raise RuntimeError(
RuntimeError: Tried to create a Cityscapes object, but the operation has failed. Please double check the arguments used to create the object.
The error message is:
The dataset is not Found or the folder structure is nonconfoumance.

我的目录结构是这样的：

PaddleSeg
    - data
        - dataset
            - origin
                1.jpg
                2.jpg
            - labels
                1.png
                2.png
            train.txt
            val.txt

经过调试，我发现paddleseg/datasets/cityscapes.py中把训练集中原始图片和标注label的路径写死了leftImg8bit和gtFine，并且不接受配置文件中的train_path参数：

经过修改paddleseg/datasets/cityscapes.py通过train.txt加载图片，重新启动训练之后又出现了显存不足报错，尝试着减少图片仅剩下10张依然报错：

    def __init__(self, transforms, dataset_root, mode='train', edge=False):
        self.dataset_root = dataset_root
        self.transforms = Compose(transforms)
        self.file_list = list()
        mode = mode.lower()
        self.mode = mode
        self.num_classes = self.NUM_CLASSES
        self.ignore_index = self.IGNORE_INDEX
        self.edge = edge

        if mode not in ['train', 'val', 'test']:
            raise ValueError(
                "mode should be 'train', 'val' or 'test', but got {}.".format(
                    mode))

        if self.transforms is None:
            raise ValueError("`transforms` is necessary, but it is None.")

        # img_dir = os.path.join(self.dataset_root, 'leftImg8bit')
        # label_dir = os.path.join(self.dataset_root, 'gtFine')
        dataset_txt = os.path.join(self.dataset_root, f"{mode}.txt")
        with open(dataset_txt, 'r', encoding='utf-8') as file:
            file_list = file.readlines()

        # 如果需要去掉每行末尾的换行符
        self.file_list = [line.strip().split(" ") for line in file_list]

        # img_dir = os.path.join(self.dataset_root, 'leftImg8bit')
        # label_dir = os.path.join(self.dataset_root, 'gtFine')
        # if self.dataset_root is None or not os.path.isdir(
        #         self.dataset_root) or not os.path.isdir(
        #             img_dir) or not os.path.isdir(label_dir):
        #     raise ValueError(
        #         "The dataset is not Found or the folder structure is nonconfoumance."
        #     )
        #
        # label_files = sorted(
        #     glob.glob(
        #         os.path.join(label_dir, mode, '*',
        #                      '*_gtFine_labelTrainIds.png')))
        # img_files = sorted(
        #     glob.glob(os.path.join(img_dir, mode, '*', '*_leftImg8bit.png')))
        #
        # self.file_list = [
        #     [img_path, label_path]
        #     for img_path, label_path in zip(img_files, label_files)
        # ]

--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   relu_ad_func(paddle::experimental::Tensor const&)
1   paddle::experimental::relu(paddle::experimental::Tensor const&)
2   void phi::ReluKernel<float, phi::GPUContext>(phi::GPUContext const&, phi::DenseTensor const&, phi::DenseTensor*)
3   void phi::ActivationGPUImpl<float, phi::GPUContext, phi::funcs::CudaReluFunctor<float> >(phi::GPUContext const&, phi::DenseTensor const&, phi::DenseTensor*, phi::funcs::CudaReluFunctor<float> const&)
4   float* phi::DeviceContext::Alloc<float>(phi::TensorBase*, unsigned long, bool) const
5   phi::DeviceContext::Impl::Alloc(phi::TensorBase*, phi::Place const&, paddle::experimental::DataType, unsigned long, bool) const
6   phi::DenseTensor::AllocateFrom(phi::Allocator*, paddle::experimental::DataType, unsigned long)
7   paddle::memory::allocation::StatAllocator::AllocateImpl(unsigned long)
8   paddle::memory::allocation::Allocator::Allocate(unsigned long)
9   paddle::memory::allocation::Allocator::Allocate(unsigned long)
10  paddle::memory::allocation::Allocator::Allocate(unsigned long)
11  paddle::memory::allocation::CUDAAllocator::AllocateImpl(unsigned long)
12  std::string phi::enforce::GetCompleteTraceBackString<std::string >(std::string&&, char const*, int)
13  phi::enforce::GetCurrentTraceBackString[abi:cxx11](bool)

----------------------
Error Message Summary:
----------------------
ResourceExhaustedError:

Out of memory error on GPU 0. Cannot allocate 640.000000MB memory on GPU 0, 11.487427GB memory has been allocated and available memory is only 252.125000MB.

Please check whether there is any other process using GPU 0.
1. If yes, please stop them, or start PaddlePaddle on another GPU.
2. If no, please decrease the batch size of your model.
If the above ways do not solve the out of memory problem, you can try to use CUDA managed memory. The command is `export FLAGS_use_cuda_managed_memory=false`.
 (at /paddle/paddle/fluid/memory/allocation/cuda_allocator.cc:95)

--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   paddle::pybind::ThrowExceptionToPython(std::__exception_ptr::exception_ptr)

----------------------
Error Message Summary:
----------------------
FatalError: `Process abort signal` is detected by the operating system.
  [TimeInfo: *** Aborted at 1710989432 (unix time) try "date -d @1710989432" if you are using GNU date ***]
  [SignalInfo: *** SIGABRT (@0x3a2dc) received by PID 238300 (TID 0x7fca47a3d440) from PID 238300 ***]

已放弃 (核心已转储)

以上训练UNet_3plus模型的问题描述，有什么解决方法吗？不胜感激

复现环境 Environment

OS: ubuntu 22
PaddlePaddle: 2.4.2.post117
python 3.8 conda
cuda: 11.7
cudnn: CUDNN_MAJOR 9
gcc: 11.4.0

Bug描述确认 Bug description confirmation

[X] 我确认已经提供了Bug复现步骤、代码改动说明、以及环境信息，确认问题是可以复现的。I confirm that the bug replication steps, code change instructions, and environment information have been provided, and the problem can be reproduced.

是否愿意提交PR？ Are you willing to submit a PR?

[X] 我愿意提交PR！I'd like to help by submitting a PR!

更新一下我的进度：以上问题是因为配置错误导致，具体是因为我没看找到明确的文档说明unet_3plus网络需要如何配置自定义数据集，经过参考mmsegmentation和本项目的其他模型README之后我发现使用自定义数据集训练的配置文件必须指定train_dataset.type为Dataset，而不是默认的Cityscapes...这是个低级错误，现在贴上我的完整配置文件：

batch_size: 1
iters: 500

train_dataset:
  type: Dataset
  dataset_root: "data/dataset"
  train_path: "data/dataset/train.txt"
  num_classes: 2
  transforms:
    - type: ResizeStepScaling
      min_scale_factor: 0.5
      max_scale_factor: 2.0
      scale_step_size: 0.25
    - type: RandomPaddingCrop
      crop_size: [1024, 512]
    - type: RandomHorizontalFlip
    - type: RandomDistort
      brightness_range: 0.4
      contrast_range: 0.4
      saturation_range: 0.4
    - type: Normalize
  mode: train

val_dataset:
  type: Dataset
  dataset_root: "data/dataset"
  val_path: "data/dataset/val.txt"
  num_classes: 2
  transforms:
    - type: Normalize
  mode: val

optimizer:
  type: SGD
  momentum: 0.9
  weight_decay: 4.0e-5

lr_scheduler:
  type: PolynomialDecay
  learning_rate: 0.0025
  end_lr: 0
  power: 0.9

loss:
  types:
    - type: CrossEntropyLoss
  coef: [1]

model:
  type: UNet3Plus
  in_channels: 3
  num_classes: 2
  is_batchnorm: True
  is_deepsup: False
  is_CGM: False

BWT, 解决一系列问题后仍然报错，在win11 CPU环境报错内存不足，在unbuntu22 RTX3060环境报错显存不足，我已经把batch_size调整到最小1，train训练集图片仅有10张、val验证集4张，这是啥问题咧？请官方团队大佬们闪亮登场呀，预先感谢任何形式的帮助：

terminate called after throwing an instance of 'paddle::memory::allocation::BadAlloc'
  what():

--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   relu_ad_func(paddle::experimental::Tensor const&)
1   paddle::experimental::relu(paddle::experimental::Tensor const&)
2   void phi::ReluKernel<float, phi::GPUContext>(phi::GPUContext const&, phi::DenseTensor const&, phi::DenseTensor*)
3   void phi::ActivationGPUImpl<float, phi::GPUContext, phi::funcs::CudaReluFunctor<float> >(phi::GPUContext const&, phi::DenseTensor const&, phi::DenseTensor*, phi::funcs::CudaReluFunctor<float> const&)
4   float* phi::DeviceContext::Alloc<float>(phi::TensorBase*, unsigned long, bool) const
5   phi::DeviceContext::Impl::Alloc(phi::TensorBase*, phi::Place const&, paddle::experimental::DataType, unsigned long, bool) const
6   phi::DenseTensor::AllocateFrom(phi::Allocator*, paddle::experimental::DataType, unsigned long)
7   paddle::memory::allocation::StatAllocator::AllocateImpl(unsigned long)
8   paddle::memory::allocation::Allocator::Allocate(unsigned long)
9   paddle::memory::allocation::Allocator::Allocate(unsigned long)
10  paddle::memory::allocation::Allocator::Allocate(unsigned long)
11  paddle::memory::allocation::CUDAAllocator::AllocateImpl(unsigned long)
12  std::string phi::enforce::GetCompleteTraceBackString<std::string >(std::string&&, char const*, int)
13  phi::enforce::GetCurrentTraceBackString[abi:cxx11](bool)

----------------------
Error Message Summary:
----------------------
ResourceExhaustedError:

Out of memory error on GPU 0. Cannot allocate 640.000000MB memory on GPU 0, 11.235474GB memory has been allocated and available memory is only 510.125000MB.

Please check whether there is any other process using GPU 0.
1. If yes, please stop them, or start PaddlePaddle on another GPU.
2. If no, please decrease the batch size of your model.
If the above ways do not solve the out of memory problem, you can try to use CUDA managed memory. The command is `export FLAGS_use_cuda_managed_memory=false`.
 (at /paddle/paddle/fluid/memory/allocation/cuda_allocator.cc:95)

--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   paddle::pybind::ThrowExceptionToPython(std::__exception_ptr::exception_ptr)

----------------------
Error Message Summary:
----------------------
FatalError: `Process abort signal` is detected by the operating system.
  [TimeInfo: *** Aborted at 1711101834 (unix time) try "date -d @1711101834" if you are using GNU date ***]
  [SignalInfo: *** SIGABRT (@0x67e4a) received by PID 425546 (TID 0x7fa0859ce440) from PID 425546 ***]

已放弃 (核心已转储)

PaddlePaddle / PaddleSeg