Open Alex37882388 opened 7 months ago
更新一下我的进度:以上问题是因为配置错误导致,具体是因为我没看找到明确的文档说明unet_3plus网络需要如何配置自定义数据集,经过参考mmsegmentation和本项目的其他模型README之后我发现 使用自定义数据集训练的配置文件必须指定train_dataset.type为Dataset,而不是默认的Cityscapes...这是个低级错误,现在贴上我的完整配置文件:
batch_size: 1
iters: 500
train_dataset:
type: Dataset
dataset_root: "data/dataset"
train_path: "data/dataset/train.txt"
num_classes: 2
transforms:
- type: ResizeStepScaling
min_scale_factor: 0.5
max_scale_factor: 2.0
scale_step_size: 0.25
- type: RandomPaddingCrop
crop_size: [1024, 512]
- type: RandomHorizontalFlip
- type: RandomDistort
brightness_range: 0.4
contrast_range: 0.4
saturation_range: 0.4
- type: Normalize
mode: train
val_dataset:
type: Dataset
dataset_root: "data/dataset"
val_path: "data/dataset/val.txt"
num_classes: 2
transforms:
- type: Normalize
mode: val
optimizer:
type: SGD
momentum: 0.9
weight_decay: 4.0e-5
lr_scheduler:
type: PolynomialDecay
learning_rate: 0.0025
end_lr: 0
power: 0.9
loss:
types:
- type: CrossEntropyLoss
coef: [1]
model:
type: UNet3Plus
in_channels: 3
num_classes: 2
is_batchnorm: True
is_deepsup: False
is_CGM: False
BWT, 解决一系列问题后仍然报错,在win11 CPU环境报错内存不足,在unbuntu22 RTX3060环境报错显存不足,我已经把batch_size调整到最小1,train训练集图片仅有10张、val验证集4张,这是啥问题咧?请官方团队大佬们闪亮登场呀,预先感谢任何形式的帮助:
terminate called after throwing an instance of 'paddle::memory::allocation::BadAlloc'
what():
--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0 relu_ad_func(paddle::experimental::Tensor const&)
1 paddle::experimental::relu(paddle::experimental::Tensor const&)
2 void phi::ReluKernel<float, phi::GPUContext>(phi::GPUContext const&, phi::DenseTensor const&, phi::DenseTensor*)
3 void phi::ActivationGPUImpl<float, phi::GPUContext, phi::funcs::CudaReluFunctor<float> >(phi::GPUContext const&, phi::DenseTensor const&, phi::DenseTensor*, phi::funcs::CudaReluFunctor<float> const&)
4 float* phi::DeviceContext::Alloc<float>(phi::TensorBase*, unsigned long, bool) const
5 phi::DeviceContext::Impl::Alloc(phi::TensorBase*, phi::Place const&, paddle::experimental::DataType, unsigned long, bool) const
6 phi::DenseTensor::AllocateFrom(phi::Allocator*, paddle::experimental::DataType, unsigned long)
7 paddle::memory::allocation::StatAllocator::AllocateImpl(unsigned long)
8 paddle::memory::allocation::Allocator::Allocate(unsigned long)
9 paddle::memory::allocation::Allocator::Allocate(unsigned long)
10 paddle::memory::allocation::Allocator::Allocate(unsigned long)
11 paddle::memory::allocation::CUDAAllocator::AllocateImpl(unsigned long)
12 std::string phi::enforce::GetCompleteTraceBackString<std::string >(std::string&&, char const*, int)
13 phi::enforce::GetCurrentTraceBackString[abi:cxx11](bool)
----------------------
Error Message Summary:
----------------------
ResourceExhaustedError:
Out of memory error on GPU 0. Cannot allocate 640.000000MB memory on GPU 0, 11.235474GB memory has been allocated and available memory is only 510.125000MB.
Please check whether there is any other process using GPU 0.
1. If yes, please stop them, or start PaddlePaddle on another GPU.
2. If no, please decrease the batch size of your model.
If the above ways do not solve the out of memory problem, you can try to use CUDA managed memory. The command is `export FLAGS_use_cuda_managed_memory=false`.
(at /paddle/paddle/fluid/memory/allocation/cuda_allocator.cc:95)
--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0 paddle::pybind::ThrowExceptionToPython(std::__exception_ptr::exception_ptr)
----------------------
Error Message Summary:
----------------------
FatalError: `Process abort signal` is detected by the operating system.
[TimeInfo: *** Aborted at 1711101834 (unix time) try "date -d @1711101834" if you are using GNU date ***]
[SignalInfo: *** SIGABRT (@0x67e4a) received by PID 425546 (TID 0x7fa0859ce440) from PID 425546 ***]
已放弃 (核心已转储)
我遇到你遇到的问题了,目前我是调小batch size解决了,但是batch size设置才能正常,我用的pp-liteseg,GPU是4070,显存是12G
哎,一样的问题,我都把batchsiz=1了还是报这个错误
问题确认 Search before asking
Bug描述 Describe the Bug
我参照自定义数据集做了一些标注,并生成了train.txt和val.txt,然后参照unet_3plus_cityscapes_1024x512_160k.yml修改了batch_size、iters和learning_rate:
然后使用如下命令启动训练任务之后收到报错信息:
报错如下:
我的目录结构是这样的:
经过调试,我发现paddleseg/datasets/cityscapes.py中把训练集中原始图片和标注label的路径写死了leftImg8bit和gtFine,并且不接受配置文件中的train_path参数:
经过修改paddleseg/datasets/cityscapes.py通过train.txt加载图片,重新启动训练之后又出现了显存不足报错,尝试着减少图片仅剩下10张依然报错:
以上训练UNet_3plus模型的问题描述,有什么解决方法吗?不胜感激
复现环境 Environment
Bug描述确认 Bug description confirmation
是否愿意提交PR? Are you willing to submit a PR?