MengyangPu / EDTER

EDTER: Edge Detection with Transformer, in CVPR 2022
MIT License
268 stars 32 forks source link

Did anyone run successfully on our own machine? Can you communicate how to do it? #32

Open Xie-Muxi-BK opened 1 year ago

Xie-Muxi-BK commented 1 year ago

Such as the title....

我严格按照作者的readme文档进行了测试,包括项目文件结构

config中的shell 和tools/test.py 都无法成功运行

image
shuttle999 commented 1 year ago

你要下载他的预训练模型放在pretrain里面,不过我也不知道如何重头开始训练,请问你解决了吗

shuttle999 commented 1 year ago

你要下载他的预训练模型放在pretrain里面,不过我也不知道如何重头开始训练,请问你解决了吗

哦哦,我看错了,这个pretrain内的权重是Vit的预训练权重

Xie-Muxi-BK commented 1 year ago

你要下载他的预训练模型放在pretrain里面,不过我也不知道如何重头开始训练,请问你解决了吗

哦哦,我看错了,这个pretrain内的权重是Vit的预训练权重

这个截图我截错了,但是我记得我当时更换了的,可能是没上传到吧。这个路径问题我当时看到报错就解决了,因为作者放出来的BSDS数据集下载不了,我是用NYUD数据集测试的,我这边运行 dist_train.sh 配置参数第一步就报错了,我是DL小白,无能为力就放弃了

shuttle999 commented 1 year ago

我是使用BSDS数据集训练的,可以训练,代码应该没问题,可能你环境配置有问题

eleveneee commented 1 year ago

hi!i ‘d like to ask you some questions about DETER's code, Would it be convenient for you?

yuuy07 commented 1 year ago

请问您运行成功了吗

githublqs commented 1 year ago

各位,下载了HED-BSDS 数据集以及VOC 的数据,因为我卡是12g的只能将两阶段的输入宽高调整为160,和80 现在训练是跑起来了,但是结果和和与训练的EDTER-BSDS-VOC-StageII.pth的结果相差很多,别的地方我没有改 有没有朋友也是我这种情况的,不知道怎么解决,我下载准备买一块3090 的显卡不知到能不能复现预训练的结果

yuuy07 commented 1 year ago

各位,下载了HED-BSDS 数据集以及VOC 的数据,因为我卡是12g的只能将两阶段的输入宽高调整为160,和80 现在训练是跑起来了,但是结果和和与训练的EDTER-BSDS-VOC-StageII.pth的结果相差很多,别的地方我没有改 有没有朋友也是我这种情况的,不知道怎么解决,我下载准备买一块3090 的显卡不知到能不能复现预训练的结果

3090多半还是不行,模型对显存要求太高了,尤其是StageⅡ,建议租显卡跑

githublqs commented 1 year ago

论文说的15,16g 显存,那3090应该可以吧,两块3090呢,目前预算只有这么多

githublqs commented 1 year ago

速度曼一点也没有关系,目前我测试了,确实DETER是sota的结果,泛化能力也很强

TerryDonneyyds commented 1 year ago

hi!i ‘d like to ask you some questions about DETER's code, Would it be convenient for you?

hello,i have some puzzles about the multi-scale test of EDTER, would u plz give me a hand? Extremly Appreicated for that

MengyangPu commented 1 year ago

模型对显存要求太高了,尤其是StageⅡ,建议租显卡

速度曼一点也没有关系,目前我测试了,确实DETER是sota的结果,泛化能力也很强

您好,在训练阶段,如果每块GPU设置训练1张320x320的图像,每张GPU需要15G;如果每块GPU设置训练4张320x320的图像,每张GPU需要越25G。如果是3090,可能需要四张卡。训练第二阶段时,第一阶段的参数是固定的,因此需要的显存略有下降。

Xakurain commented 10 months ago

请问各位用的python库都是哪些版本的,我按照readme里进行环境配置,目前一直跑不起来。

hhqweasd commented 10 months ago

@Xakurain 我这边装完是这样的,可以跑起来

pip list

Package |Version | Editable project location addict 2.4.0 appdirs 1.4.4 certifi 2022.12.7 cityscapesScripts 2.2.1 clip 1.0 coloredlogs 15.0.1 contourpy 1.0.7 cycler 0.11.0 fonttools 4.39.3 ftfy 6.1.1 future 0.18.3 h5py 3.8.0 humanfriendly 10.0 importlib-resources 5.12.0 kiwisolver 1.4.4 matplotlib 3.7.1 mmcv-full 1.2.2 mmsegmentation 0.6.0 /root/EDTER-main numpy 1.24.2 opencv-python 4.7.0.72 packaging 23.0 Pillow 9.5.0 pip 23.0.1 pyparsing 3.0.9 pyquaternion 0.9.9 python-dateutil 2.8.2 PyYAML 6.0 regex 2023.3.23 scipy 1.10.1 setuptools 65.6.3 six 1.16.0 torch 1.6.0+cu101 torchvision 0.7.0+cu101 tqdm 4.65.0 typing 3.7.4.3 wcwidth 0.2.6 wheel 0.38.4 yapf 0.32.0 zipp 3.15.0

conda list

packages in environment at /root/.local/conda/envs/edge: Name | Version | Build Channel _libgcc_mutex 0.1 main https://mirrors.aliyun.com/anaconda/pkgs/main _openmp_mutex 5.1 1_gnu https://mirrors.aliyun.com/anaconda/pkgs/main addict 2.4.0 pypi_0 pypi appdirs 1.4.4 pypi_0 pypi ca-certificates 2023.01.10 h06a4308_0 https://mirrors.aliyun.com/anaconda/pkgs/main certifi 2022.12.7 py38h06a4308_0 https://mirrors.aliyun.com/anaconda/pkgs/main cityscapesscripts 2.2.1 pypi_0 pypi clip 1.0 pypi_0 pypi coloredlogs 15.0.1 pypi_0 pypi contourpy 1.0.7 pypi_0 pypi cycler 0.11.0 pypi_0 pypi fonttools 4.39.3 pypi_0 pypi ftfy 6.1.1 pypi_0 pypi future 0.18.3 pypi_0 pypi h5py 3.8.0 pypi_0 pypi humanfriendly 10.0 pypi_0 pypi importlib-resources 5.12.0 pypi_0 pypi kiwisolver 1.4.4 pypi_0 pypi ld_impl_linux-64 2.38 h1181459_1 https://mirrors.aliyun.com/anaconda/pkgs/main libffi 3.4.2 h6a678d5_6 https://mirrors.aliyun.com/anaconda/pkgs/main libgcc-ng 11.2.0 h1234567_1 https://mirrors.aliyun.com/anaconda/pkgs/main libgomp 11.2.0 h1234567_1 https://mirrors.aliyun.com/anaconda/pkgs/main libstdcxx-ng 11.2.0 h1234567_1 https://mirrors.aliyun.com/anaconda/pkgs/main matplotlib 3.7.1 pypi_0 pypi mmcv-full 1.2.2 pypi_0 pypi mmsegmentation 0.6.0 dev_0 ncurses 6.4 h6a678d5_0 https://mirrors.aliyun.com/anaconda/pkgs/main numpy 1.24.2 pypi_0 pypi opencv-python 4.7.0.72 pypi_0 pypi openssl 1.1.1t h7f8727e_0 https://mirrors.aliyun.com/anaconda/pkgs/main packaging 23.0 pypi_0 pypi pillow 9.5.0 pypi_0 pypi pip 23.0.1 py38h06a4308_0 https://mirrors.aliyun.com/anaconda/pkgs/main pyparsing 3.0.9 pypi_0 pypi pyquaternion 0.9.9 pypi_0 pypi python 3.8.16 h7a1cb2a_3 https://mirrors.aliyun.com/anaconda/pkgs/main python-dateutil 2.8.2 pypi_0 pypi pyyaml 6.0 pypi_0 pypi readline 8.2 h5eee18b_0 https://mirrors.aliyun.com/anaconda/pkgs/main regex 2023.3.23 pypi_0 pypi scipy 1.10.1 pypi_0 pypi setuptools 65.6.3 py38h06a4308_0 https://mirrors.aliyun.com/anaconda/pkgs/main six 1.16.0 pypi_0 pypi sqlite 3.41.1 h5eee18b_0 https://mirrors.aliyun.com/anaconda/pkgs/main tk 8.6.12 h1ccaba5_0 https://mirrors.aliyun.com/anaconda/pkgs/main torch 1.6.0+cu101 pypi_0 pypi torchvision 0.7.0+cu101 pypi_0 pypi tqdm 4.65.0 pypi_0 pypi typing 3.7.4.3 pypi_0 pypi wcwidth 0.2.6 pypi_0 pypi wheel 0.38.4 py38h06a4308_0 https://mirrors.aliyun.com/anaconda/pkgs/main xz 5.2.10 h5eee18b_1 https://mirrors.aliyun.com/anaconda/pkgs/main yapf 0.32.0 pypi_0 pypi zipp 3.15.0 pypi_0 pypi zlib 1.2.13 h5eee18b_0 https://mirrors.aliyun.com/anaconda/pkgs/main

Snailgoo commented 9 months ago

速度曼一点也没有关系,目前我测试了,确实DETER是sota的结果,泛化能力也很强

有没有遇到这个问题: tcp_store = TCPStore(hostname, port, world_size, False, timeout) RuntimeError: connect() timed out. Original timeout was 1800000 ms. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 289) of binary

MengyangPu commented 9 months ago

速度曼一点也没有关系,目前我测试了,确实DETER是sota的结果,泛化能力也很强

有没有遇到这个问题: tcp_store = TCPStore(hostname, port, world_size, False, timeout) RuntimeError: connect() timed out. Original timeout was 1800000 ms. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 289) of binary

通过您的描述看应该是分布式或者网络故障问题。我们在tools/train.py的第20-21行默认设置了os.environ['MASTER_ADDR']='127.0.0.4' os.environ['MASTER_PORT']='29504' 请确认该address和port是否未被占用。 https://github.com/MengyangPu/EDTER/blob/fc3e182c267653831c49b7ae6a06c04cebc858fd/tools/train.py#L20C2-L21C2

MengyangPu commented 9 months ago

速度曼一点也没有关系,目前我测试了,确实DETER是sota的结果,泛化能力也很强

有没有遇到这个问题: tcp_store = TCPStore(hostname, port, world_size, False, timeout) RuntimeError: connect() timed out. Original timeout was 1800000 ms. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 289) of binary

希望这个问题对您有帮助。

Snailgoo commented 9 months ago

速度曼一点也没有关系,目前我测试了,确实DETER是sota的结果,泛化能力也很强

有没有遇到这个问题: tcp_store = TCPStore(hostname, port, world_size, False, timeout) RuntimeError: connect() timed out. Original timeout was 1800000 ms. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 289) of binary

通过您的描述看应该是分布式或者网络故障问题。我们在tools/train.py的第20-21行默认设置了os.environ['MASTER_ADDR']='127.0.0.4' os.environ['MASTER_PORT']='29504' 请确认该address和port是否未被占用。 https://github.com/MengyangPu/EDTER/blob/fc3e182c267653831c49b7ae6a06c04cebc858fd/tools/train.py#L20C2-L21C2

您好,能否只用单卡训练?我设置了训练参数为 bash ./tools/dist_train.sh configs/bsds/EDTER_BIMLA_320x320_80k_bsds_bs_8.py 1 --gpu-ids 1 只使用1号卡,发现还是遇到同样的问题,网段和端口也换了其它的

MengyangPu commented 9 months ago

速度曼一点也没有关系,目前我测试了,确实DETER是sota的结果,泛化能力也很强

有没有遇到这个问题: tcp_store = TCPStore(hostname, port, world_size, False, timeout) RuntimeError: connect() timed out. Original timeout was 1800000 ms. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 289) of binary

通过您的描述看应该是分布式或者网络故障问题。我们在tools/train.py的第20-21行默认设置了os.environ['MASTER_ADDR']='127.0.0.4' os.environ['MASTER_PORT']='29504' 请确认该address和port是否未被占用。 https://github.com/MengyangPu/EDTER/blob/fc3e182c267653831c49b7ae6a06c04cebc858fd/tools/train.py#L20C2-L21C2

您好,能否只用单卡训练?我设置了训练参数为 bash ./tools/dist_train.sh configs/bsds/EDTER_BIMLA_320x320_80k_bsds_bs_8.py 1 --gpu-ids 1 只使用1号卡,发现还是遇到同样的问题,网段和端口也换了其它的

您好,如果不用分布式的话,请将launcher设置为none(不使用分布式),在https://github.com/MengyangPu/EDTER/blob/3df1a182a095fe1f52a55695d7bd7ac727641cab/tools/train.py#L57-L61

Snailgoo commented 9 months ago

速度曼一点也没有关系,目前我测试了,确实DETER是sota的结果,泛化能力也很强

有没有遇到这个问题: tcp_store = TCPStore(hostname, port, world_size, False, timeout) RuntimeError: connect() timed out. Original timeout was 1800000 ms. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 289) of binary

通过您的描述看应该是分布式或者网络故障问题。我们在tools/train.py的第20-21行默认设置了os.environ['MASTER_ADDR']='127.0.0.4' os.environ['MASTER_PORT']='29504' 请确认该address和port是否未被占用。 https://github.com/MengyangPu/EDTER/blob/fc3e182c267653831c49b7ae6a06c04cebc858fd/tools/train.py#L20C2-L21C2

您好,能否只用单卡训练?我设置了训练参数为 bash ./tools/dist_train.sh configs/bsds/EDTER_BIMLA_320x320_80k_bsds_bs_8.py 1 --gpu-ids 1 只使用1号卡,发现还是遇到同样的问题,网段和端口也换了其它的

您好,如果不用分布式的话,请将launcher设置为none(不使用分布式),在

https://github.com/MengyangPu/EDTER/blob/3df1a182a095fe1f52a55695d7bd7ac727641cab/tools/train.py#L57-L61

好的,非常感谢耐心指导,已经能跑起来BSDS数据集的训练了。 还有一个问题请教下,就是我自己的数据集按照BSDS格式构建的,但是test只有图片,没有.mat文件,训练的时候报错 File "/EDTER/mmseg/datasets/builder.py", line 73, in build_dataset dataset = build_from_cfg(cfg, DATASETS, default_args) File "/lib/python3.9/site-packages/mmcv/utils/registry.py", line 55, in build_from_cfg raise type(e)(f'{obj_cls.name}: {e}') ValueError: BSDSDataset: not enough values to unpack (expected 2, got 1) 是不是必须要有.mat文件,这个文件是怎么生成的呢?如果不生成,需要改什么地方才能跑起来咱的代码呢?

MengyangPu commented 9 months ago

速度曼一点也没有关系,目前我测试了,确实DETER是sota的结果,泛化能力也很强

有没有遇到这个问题: tcp_store = TCPStore(hostname, port, world_size, False, timeout) RuntimeError: connect() timed out. Original timeout was 1800000 ms. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 289) of binary

通过您的描述看应该是分布式或者网络故障问题。我们在tools/train.py的第20-21行默认设置了os.environ['MASTER_ADDR']='127.0.0.4' os.environ['MASTER_PORT']='29504' 请确认该address和port是否未被占用。 https://github.com/MengyangPu/EDTER/blob/fc3e182c267653831c49b7ae6a06c04cebc858fd/tools/train.py#L20C2-L21C2

您好,能否只用单卡训练?我设置了训练参数为 bash ./tools/dist_train.sh configs/bsds/EDTER_BIMLA_320x320_80k_bsds_bs_8.py 1 --gpu-ids 1 只使用1号卡,发现还是遇到同样的问题,网段和端口也换了其它的

您好,如果不用分布式的话,请将launcher设置为none(不使用分布式),在 https://github.com/MengyangPu/EDTER/blob/3df1a182a095fe1f52a55695d7bd7ac727641cab/tools/train.py#L57-L61

好的,非常感谢耐心指导,已经能跑起来BSDS数据集的训练了。 还有一个问题请教下,就是我自己的数据集按照BSDS格式构建的,但是test只有图片,没有.mat文件,训练的时候报错 File "/EDTER/mmseg/datasets/builder.py", line 73, in build_dataset dataset = build_from_cfg(cfg, DATASETS, default_args) File "/lib/python3.9/site-packages/mmcv/utils/registry.py", line 55, in build_from_cfg raise type(e)(f'{obj_cls.name}: {e}') ValueError: BSDSDataset: not enough values to unpack (expected 2, got 1) 是不是必须要有.mat文件,这个文件是怎么生成的呢?如果不生成,需要改什么地方才能跑起来咱的代码呢?

您好,不是非要有.mat文件,只是为了占位,只要在test.txt里有这么一列就可以,并不会真的读取到这个数据。

Snailgoo commented 9 months ago

速度曼一点也没有关系,目前我测试了,确实DETER是sota的结果,泛化能力也很强

有没有遇到这个问题: tcp_store = TCPStore(hostname, port, world_size, False, timeout) RuntimeError: connect() timed out. Original timeout was 1800000 ms. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 289) of binary

通过您的描述看应该是分布式或者网络故障问题。我们在tools/train.py的第20-21行默认设置了os.environ['MASTER_ADDR']='127.0.0.4' os.environ['MASTER_PORT']='29504' 请确认该address和port是否未被占用。 https://github.com/MengyangPu/EDTER/blob/fc3e182c267653831c49b7ae6a06c04cebc858fd/tools/train.py#L20C2-L21C2

您好,能否只用单卡训练?我设置了训练参数为 bash ./tools/dist_train.sh configs/bsds/EDTER_BIMLA_320x320_80k_bsds_bs_8.py 1 --gpu-ids 1 只使用1号卡,发现还是遇到同样的问题,网段和端口也换了其它的

您好,如果不用分布式的话,请将launcher设置为none(不使用分布式),在 https://github.com/MengyangPu/EDTER/blob/3df1a182a095fe1f52a55695d7bd7ac727641cab/tools/train.py#L57-L61

好的,非常感谢耐心指导,已经能跑起来BSDS数据集的训练了。 还有一个问题请教下,就是我自己的数据集按照BSDS格式构建的,但是test只有图片,没有.mat文件,训练的时候报错 File "/EDTER/mmseg/datasets/builder.py", line 73, in build_dataset dataset = build_from_cfg(cfg, DATASETS, default_args) File "/lib/python3.9/site-packages/mmcv/utils/registry.py", line 55, in build_from_cfg raise type(e)(f'{obj_cls.name}: {e}') ValueError: BSDSDataset: not enough values to unpack (expected 2, got 1) 是不是必须要有.mat文件,这个文件是怎么生成的呢?如果不生成,需要改什么地方才能跑起来咱的代码呢?

您好,不是非要有.mat文件,只是为了占位,只要在test.txt里有这么一列就可以,并不会真的读取到这个数据。

您好,感谢您的回复!我按照格式只在test.txt中加了一列.mat,可以训练起来,但在10000 iterations结束时,报错 FileNotFoundError: [Errno 2] No such file or directory: '/test/10000.mat'

MengyangPu commented 9 months ago

能看一下是具体哪一行代码报错吗?

Snailgoo commented 9 months ago

速度曼一点也没有关系,目前我测试了,确实DETER是sota的结果,泛化能力也很强

有没有遇到这个问题: tcp_store = TCPStore(hostname, port, world_size, False, timeout) RuntimeError: connect() timed out. Original timeout was 1800000 ms. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 289) of binary

通过您的描述看应该是分布式或者网络故障问题。我们在tools/train.py的第20-21行默认设置了os.environ['MASTER_ADDR']='127.0.0.4' os.environ['MASTER_PORT']='29504' 请确认该address和port是否未被占用。 https://github.com/MengyangPu/EDTER/blob/fc3e182c267653831c49b7ae6a06c04cebc858fd/tools/train.py#L20C2-L21C2

您好,能否只用单卡训练?我设置了训练参数为 bash ./tools/dist_train.sh configs/bsds/EDTER_BIMLA_320x320_80k_bsds_bs_8.py 1 --gpu-ids 1 只使用1号卡,发现还是遇到同样的问题,网段和端口也换了其它的

您好,如果不用分布式的话,请将launcher设置为none(不使用分布式),在 https://github.com/MengyangPu/EDTER/blob/3df1a182a095fe1f52a55695d7bd7ac727641cab/tools/train.py#L57-L61

好的,非常感谢耐心指导,已经能跑起来BSDS数据集的训练了。 还有一个问题请教下,就是我自己的数据集按照BSDS格式构建的,但是test只有图片,没有.mat文件,训练的时候报错 File "/EDTER/mmseg/datasets/builder.py", line 73, in build_dataset dataset = build_from_cfg(cfg, DATASETS, default_args) File "/lib/python3.9/site-packages/mmcv/utils/registry.py", line 55, in build_from_cfg raise type(e)(f'{obj_cls.name}: {e}') ValueError: BSDSDataset: not enough values to unpack (expected 2, got 1) 是不是必须要有.mat文件,这个文件是怎么生成的呢?如果不生成,需要改什么地方才能跑起来咱的代码呢?

您好,不是非要有.mat文件,只是为了占位,只要在test.txt里有这么一列就可以,并不会真的读取到这个数据。

您好,感谢您的回复!我按照格式只在test.txt中加了一列.mat,可以训练起来,但在10000 iterations结束时,报错 FileNotFoundError: [Errno 2] No such file or directory: '/test/10000.mat'

您好,具体报错如下: 2023-11-23 01:26:30,391 - mmseg - INFO - Iter [9980/80000] lr: 8.881e-07, eta: 8:02:39, time: 0.363, data_time: 0.002, memory: 29987, decode.loss_seg: 199.0960, aux_0.loss_seg: 89.2975, aux_1.loss_seg: 84.7263, aux_2.loss_seg: 84.0529, aux_3.loss_seg: 83.4825, aux_4.loss_seg: 83.7665, aux_5.loss_seg: 84.4057, aux_6.loss_seg: 85.0462, aux_7.loss_seg: 85.2248, loss: 879.0984 2023-11-23 01:26:37,711 - mmseg - INFO - Saving checkpoint at 10000 iterations [>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 148/148, 6.9 task/s, elapsed: 21s, ETA: 0sTraceback (most recent call last): File "/home/EDTER/./tools/train.py", line 168, in main() File "/home/EDTER/./tools/train.py", line 157, in main train_segmentor( File "/home/EDTER/mmseg/apis/train.py", line 108, in train_segmentor runner.run(data_loaders, cfg.workflow, cfg.total_iters) File "/home/edgedetection/lib/python3.9/site-packages/mmcv/runner/iter_based_runner.py", line 134, in run iter_runner(iter_loaders[i], **kwargs) File "/home/edgedetection/lib/python3.9/site-packages/mmcv/runner/iter_based_runner.py", line 67, in train self.call_hook('after_train_iter') File "/home/edgedetection/lib/python3.9/site-packages/mmcv/runner/base_runner.py", line 309, in call_hook getattr(hook, fn_name)(self) File "/home/EDTER/mmseg/core/evaluation/eval_hooks.py", line 30, in after_train_iter self.evaluate(runner, results) File "/home/EDTER/mmseg/core/evaluation/eval_hooks.py", line 34, in evaluate eval_res = self.dataloader.dataset.evaluate( File "/home/EDTER/mmseg/datasets/custom.py", line 335, in evaluate gt_seg_maps = self.get_gt_seg_maps() File "/home/EDTER/mmseg/datasets/custom.py", line 235, in get_gt_seg_maps gt_seg_map = mmcv.imread( File "/home/edgedetection/lib/python3.9/site-packages/mmcv/image/io.py", line 203, in imread img_bytes = file_client.get(img_or_path) File "/home/edgedetection/lib/python3.9/site-packages/mmcv/fileio/file_client.py", line 993, in get return self.client.get(filepath) File "/home/edgedetection/lib/python3.9/site-packages/mmcv/fileio/file_client.py", line 518, in get with open(filepath, 'rb') as f: FileNotFoundError: [Errno 2] No such file or directory: '/home/EDTER/data/f_v0/test/10000.mat' ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3467) of binary: /home/edgedetection/bin/python Traceback (most recent call last): File "/home/edgedetection/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/edgedetection/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/edgedetection/lib/python3.9/site-packages/torch/distributed/launch.py", line 193, in main() File "/home/edgedetection/lib/python3.9/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/home/edgedetection/lib/python3.9/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/home/edgedetection/lib/python3.9/site-packages/torch/distributed/run.py", line 710, in run elastic_launch( File "/home/edgedetection/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/edgedetection/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ./tools/train.py FAILED

MengyangPu commented 9 months ago

速度曼一点也没有关系,目前我测试了,确实DETER是sota的结果,泛化能力也很强

有没有遇到这个问题: tcp_store = TCPStore(hostname, port, world_size, False, timeout) RuntimeError: connect() timed out. Original timeout was 1800000 ms. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 289) of binary

通过您的描述看应该是分布式或者网络故障问题。我们在tools/train.py的第20-21行默认设置了os.environ['MASTER_ADDR']='127.0.0.4' os.environ['MASTER_PORT']='29504' 请确认该address和port是否未被占用。 https://github.com/MengyangPu/EDTER/blob/fc3e182c267653831c49b7ae6a06c04cebc858fd/tools/train.py#L20C2-L21C2

您好,能否只用单卡训练?我设置了训练参数为 bash ./tools/dist_train.sh configs/bsds/EDTER_BIMLA_320x320_80k_bsds_bs_8.py 1 --gpu-ids 1 只使用1号卡,发现还是遇到同样的问题,网段和端口也换了其它的

您好,如果不用分布式的话,请将launcher设置为none(不使用分布式),在 https://github.com/MengyangPu/EDTER/blob/3df1a182a095fe1f52a55695d7bd7ac727641cab/tools/train.py#L57-L61

好的,非常感谢耐心指导,已经能跑起来BSDS数据集的训练了。 还有一个问题请教下,就是我自己的数据集按照BSDS格式构建的,但是test只有图片,没有.mat文件,训练的时候报错 File "/EDTER/mmseg/datasets/builder.py", line 73, in build_dataset dataset = build_from_cfg(cfg, DATASETS, default_args) File "/lib/python3.9/site-packages/mmcv/utils/registry.py", line 55, in build_from_cfg raise type(e)(f'{obj_cls.name}: {e}') ValueError: BSDSDataset: not enough values to unpack (expected 2, got 1) 是不是必须要有.mat文件,这个文件是怎么生成的呢?如果不生成,需要改什么地方才能跑起来咱的代码呢?

您好,不是非要有.mat文件,只是为了占位,只要在test.txt里有这么一列就可以,并不会真的读取到这个数据。

您好,感谢您的回复!我按照格式只在test.txt中加了一列.mat,可以训练起来,但在10000 iterations结束时,报错 FileNotFoundError: [Errno 2] No such file or directory: '/test/10000.mat'

您好,具体报错如下: 2023-11-23 01:26:30,391 - mmseg - INFO - Iter [9980/80000] lr: 8.881e-07, eta: 8:02:39, time: 0.363, data_time: 0.002, memory: 29987, decode.loss_seg: 199.0960, aux_0.loss_seg: 89.2975, aux_1.loss_seg: 84.7263, aux_2.loss_seg: 84.0529, aux_3.loss_seg: 83.4825, aux_4.loss_seg: 83.7665, aux_5.loss_seg: 84.4057, aux_6.loss_seg: 85.0462, aux_7.loss_seg: 85.2248, loss: 879.0984 2023-11-23 01:26:37,711 - mmseg - INFO - Saving checkpoint at 10000 iterations [>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 148/148, 6.9 task/s, elapsed: 21s, ETA: 0sTraceback (most recent call last): File "/home/EDTER/./tools/train.py", line 168, in main() File "/home/EDTER/./tools/train.py", line 157, in main train_segmentor( File "/home/EDTER/mmseg/apis/train.py", line 108, in train_segmentor runner.run(data_loaders, cfg.workflow, cfg.total_iters) File "/home/edgedetection/lib/python3.9/site-packages/mmcv/runner/iter_based_runner.py", line 134, in run iter_runner(iter_loaders[i], **kwargs) File "/home/edgedetection/lib/python3.9/site-packages/mmcv/runner/iter_based_runner.py", line 67, in train self.call_hook('after_train_iter') File "/home/edgedetection/lib/python3.9/site-packages/mmcv/runner/base_runner.py", line 309, in call_hook getattr(hook, fn_name)(self) File "/home/EDTER/mmseg/core/evaluation/eval_hooks.py", line 30, in after_train_iter self.evaluate(runner, results) File "/home/EDTER/mmseg/core/evaluation/eval_hooks.py", line 34, in evaluate eval_res = self.dataloader.dataset.evaluate( File "/home/EDTER/mmseg/datasets/custom.py", line 335, in evaluate gt_seg_maps = self.get_gt_seg_maps() File "/home/EDTER/mmseg/datasets/custom.py", line 235, in get_gt_seg_maps gt_seg_map = mmcv.imread( File "/home/edgedetection/lib/python3.9/site-packages/mmcv/image/io.py", line 203, in imread img_bytes = file_client.get(img_or_path) File "/home/edgedetection/lib/python3.9/site-packages/mmcv/fileio/file_client.py", line 993, in get return self.client.get(filepath) File "/home/edgedetection/lib/python3.9/site-packages/mmcv/fileio/file_client.py", line 518, in get with open(filepath, 'rb') as f: FileNotFoundError: [Errno 2] No such file or directory: '/home/EDTER/data/f_v0/test/10000.mat' ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3467) of binary: /home/edgedetection/bin/python Traceback (most recent call last): File "/home/edgedetection/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/edgedetection/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/edgedetection/lib/python3.9/site-packages/torch/distributed/launch.py", line 193, in main() File "/home/edgedetection/lib/python3.9/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/home/edgedetection/lib/python3.9/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/home/edgedetection/lib/python3.9/site-packages/torch/distributed/run.py", line 710, in run elastic_launch( File "/home/edgedetection/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/edgedetection/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ./tools/train.py FAILED

请把这一行代码注释掉 https://github.com/MengyangPu/EDTER/blob/3fe76f3d938206ef9dc8b857a9767b8cd3d28fc7/mmseg/core/evaluation/eval_hooks.py#L30 因为我用的是分布式,所以没有检查不用分布式的流程,感谢您提供的细节和反馈。

Snailgoo commented 9 months ago

速度曼一点也没有关系,目前我测试了,确实DETER是sota的结果,泛化能力也很强

有没有遇到这个问题: tcp_store = TCPStore(hostname, port, world_size, False, timeout) RuntimeError: connect() timed out. Original timeout was 1800000 ms. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 289) of binary

通过您的描述看应该是分布式或者网络故障问题。我们在tools/train.py的第20-21行默认设置了os.environ['MASTER_ADDR']='127.0.0.4' os.environ['MASTER_PORT']='29504' 请确认该address和port是否未被占用。 https://github.com/MengyangPu/EDTER/blob/fc3e182c267653831c49b7ae6a06c04cebc858fd/tools/train.py#L20C2-L21C2

您好,能否只用单卡训练?我设置了训练参数为 bash ./tools/dist_train.sh configs/bsds/EDTER_BIMLA_320x320_80k_bsds_bs_8.py 1 --gpu-ids 1 只使用1号卡,发现还是遇到同样的问题,网段和端口也换了其它的

您好,如果不用分布式的话,请将launcher设置为none(不使用分布式),在 https://github.com/MengyangPu/EDTER/blob/3df1a182a095fe1f52a55695d7bd7ac727641cab/tools/train.py#L57-L61

好的,非常感谢耐心指导,已经能跑起来BSDS数据集的训练了。 还有一个问题请教下,就是我自己的数据集按照BSDS格式构建的,但是test只有图片,没有.mat文件,训练的时候报错 File "/EDTER/mmseg/datasets/builder.py", line 73, in build_dataset dataset = build_from_cfg(cfg, DATASETS, default_args) File "/lib/python3.9/site-packages/mmcv/utils/registry.py", line 55, in build_from_cfg raise type(e)(f'{obj_cls.name}: {e}') ValueError: BSDSDataset: not enough values to unpack (expected 2, got 1) 是不是必须要有.mat文件,这个文件是怎么生成的呢?如果不生成,需要改什么地方才能跑起来咱的代码呢?

您好,不是非要有.mat文件,只是为了占位,只要在test.txt里有这么一列就可以,并不会真的读取到这个数据。

您好,感谢您的回复!我按照格式只在test.txt中加了一列.mat,可以训练起来,但在10000 iterations结束时,报错 FileNotFoundError: [Errno 2] No such file or directory: '/test/10000.mat'

您好,具体报错如下: 2023-11-23 01:26:30,391 - mmseg - INFO - Iter [9980/80000] lr: 8.881e-07, eta: 8:02:39, time: 0.363, data_time: 0.002, memory: 29987, decode.loss_seg: 199.0960, aux_0.loss_seg: 89.2975, aux_1.loss_seg: 84.7263, aux_2.loss_seg: 84.0529, aux_3.loss_seg: 83.4825, aux_4.loss_seg: 83.7665, aux_5.loss_seg: 84.4057, aux_6.loss_seg: 85.0462, aux_7.loss_seg: 85.2248, loss: 879.0984 2023-11-23 01:26:37,711 - mmseg - INFO - Saving checkpoint at 10000 iterations [>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 148/148, 6.9 task/s, elapsed: 21s, ETA: 0sTraceback (most recent call last): File "/home/EDTER/./tools/train.py", line 168, in main() File "/home/EDTER/./tools/train.py", line 157, in main train_segmentor( File "/home/EDTER/mmseg/apis/train.py", line 108, in train_segmentor runner.run(data_loaders, cfg.workflow, cfg.total_iters) File "/home/edgedetection/lib/python3.9/site-packages/mmcv/runner/iter_based_runner.py", line 134, in run iter_runner(iter_loaders[i], **kwargs) File "/home/edgedetection/lib/python3.9/site-packages/mmcv/runner/iter_based_runner.py", line 67, in train self.call_hook('after_train_iter') File "/home/edgedetection/lib/python3.9/site-packages/mmcv/runner/base_runner.py", line 309, in call_hook getattr(hook, fn_name)(self) File "/home/EDTER/mmseg/core/evaluation/eval_hooks.py", line 30, in after_train_iter self.evaluate(runner, results) File "/home/EDTER/mmseg/core/evaluation/eval_hooks.py", line 34, in evaluate eval_res = self.dataloader.dataset.evaluate( File "/home/EDTER/mmseg/datasets/custom.py", line 335, in evaluate gt_seg_maps = self.get_gt_seg_maps() File "/home/EDTER/mmseg/datasets/custom.py", line 235, in get_gt_seg_maps gt_seg_map = mmcv.imread( File "/home/edgedetection/lib/python3.9/site-packages/mmcv/image/io.py", line 203, in imread img_bytes = file_client.get(img_or_path) File "/home/edgedetection/lib/python3.9/site-packages/mmcv/fileio/file_client.py", line 993, in get return self.client.get(filepath) File "/home/edgedetection/lib/python3.9/site-packages/mmcv/fileio/file_client.py", line 518, in get with open(filepath, 'rb') as f: FileNotFoundError: [Errno 2] No such file or directory: '/home/EDTER/data/f_v0/test/10000.mat' ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3467) of binary: /home/edgedetection/bin/python Traceback (most recent call last): File "/home/edgedetection/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/edgedetection/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/edgedetection/lib/python3.9/site-packages/torch/distributed/launch.py", line 193, in main() File "/home/edgedetection/lib/python3.9/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/home/edgedetection/lib/python3.9/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/home/edgedetection/lib/python3.9/site-packages/torch/distributed/run.py", line 710, in run elastic_launch( File "/home/edgedetection/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/edgedetection/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ./tools/train.py FAILED

请把这一行代码注释掉

https://github.com/MengyangPu/EDTER/blob/3fe76f3d938206ef9dc8b857a9767b8cd3d28fc7/mmseg/core/evaluation/eval_hooks.py#L30

因为我用的是分布式,所以没有检查不用分布式的流程,感谢您提供的细节和反馈。

速度曼一点也没有关系,目前我测试了,确实DETER是sota的结果,泛化能力也很强

有没有遇到这个问题: tcp_store = TCPStore(hostname, port, world_size, False, timeout) RuntimeError: connect() timed out. Original timeout was 1800000 ms. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 289) of binary

通过您的描述看应该是分布式或者网络故障问题。我们在tools/train.py的第20-21行默认设置了os.environ['MASTER_ADDR']='127.0.0.4' os.environ['MASTER_PORT']='29504' 请确认该address和port是否未被占用。 https://github.com/MengyangPu/EDTER/blob/fc3e182c267653831c49b7ae6a06c04cebc858fd/tools/train.py#L20C2-L21C2

您好,能否只用单卡训练?我设置了训练参数为 bash ./tools/dist_train.sh configs/bsds/EDTER_BIMLA_320x320_80k_bsds_bs_8.py 1 --gpu-ids 1 只使用1号卡,发现还是遇到同样的问题,网段和端口也换了其它的

您好,如果不用分布式的话,请将launcher设置为none(不使用分布式),在 https://github.com/MengyangPu/EDTER/blob/3df1a182a095fe1f52a55695d7bd7ac727641cab/tools/train.py#L57-L61

好的,非常感谢耐心指导,已经能跑起来BSDS数据集的训练了。 还有一个问题请教下,就是我自己的数据集按照BSDS格式构建的,但是test只有图片,没有.mat文件,训练的时候报错 File "/EDTER/mmseg/datasets/builder.py", line 73, in build_dataset dataset = build_from_cfg(cfg, DATASETS, default_args) File "/lib/python3.9/site-packages/mmcv/utils/registry.py", line 55, in build_from_cfg raise type(e)(f'{obj_cls.name}: {e}') ValueError: BSDSDataset: not enough values to unpack (expected 2, got 1) 是不是必须要有.mat文件,这个文件是怎么生成的呢?如果不生成,需要改什么地方才能跑起来咱的代码呢?

您好,不是非要有.mat文件,只是为了占位,只要在test.txt里有这么一列就可以,并不会真的读取到这个数据。

您好,感谢您的回复!我按照格式只在test.txt中加了一列.mat,可以训练起来,但在10000 iterations结束时,报错 FileNotFoundError: [Errno 2] No such file or directory: '/test/10000.mat'

您好,具体报错如下: 2023-11-23 01:26:30,391 - mmseg - INFO - Iter [9980/80000] lr: 8.881e-07, eta: 8:02:39, time: 0.363, data_time: 0.002, memory: 29987, decode.loss_seg: 199.0960, aux_0.loss_seg: 89.2975, aux_1.loss_seg: 84.7263, aux_2.loss_seg: 84.0529, aux_3.loss_seg: 83.4825, aux_4.loss_seg: 83.7665, aux_5.loss_seg: 84.4057, aux_6.loss_seg: 85.0462, aux_7.loss_seg: 85.2248, loss: 879.0984 2023-11-23 01:26:37,711 - mmseg - INFO - Saving checkpoint at 10000 iterations [>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 148/148, 6.9 task/s, elapsed: 21s, ETA: 0sTraceback (most recent call last): File "/home/EDTER/./tools/train.py", line 168, in main() File "/home/EDTER/./tools/train.py", line 157, in main train_segmentor( File "/home/EDTER/mmseg/apis/train.py", line 108, in train_segmentor runner.run(data_loaders, cfg.workflow, cfg.total_iters) File "/home/edgedetection/lib/python3.9/site-packages/mmcv/runner/iter_based_runner.py", line 134, in run iter_runner(iter_loaders[i], **kwargs) File "/home/edgedetection/lib/python3.9/site-packages/mmcv/runner/iter_based_runner.py", line 67, in train self.call_hook('after_train_iter') File "/home/edgedetection/lib/python3.9/site-packages/mmcv/runner/base_runner.py", line 309, in call_hook getattr(hook, fn_name)(self) File "/home/EDTER/mmseg/core/evaluation/eval_hooks.py", line 30, in after_train_iter self.evaluate(runner, results) File "/home/EDTER/mmseg/core/evaluation/eval_hooks.py", line 34, in evaluate eval_res = self.dataloader.dataset.evaluate( File "/home/EDTER/mmseg/datasets/custom.py", line 335, in evaluate gt_seg_maps = self.get_gt_seg_maps() File "/home/EDTER/mmseg/datasets/custom.py", line 235, in get_gt_seg_maps gt_seg_map = mmcv.imread( File "/home/edgedetection/lib/python3.9/site-packages/mmcv/image/io.py", line 203, in imread img_bytes = file_client.get(img_or_path) File "/home/edgedetection/lib/python3.9/site-packages/mmcv/fileio/file_client.py", line 993, in get return self.client.get(filepath) File "/home/edgedetection/lib/python3.9/site-packages/mmcv/fileio/file_client.py", line 518, in get with open(filepath, 'rb') as f: FileNotFoundError: [Errno 2] No such file or directory: '/home/EDTER/data/f_v0/test/10000.mat' ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3467) of binary: /home/edgedetection/bin/python Traceback (most recent call last): File "/home/edgedetection/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/edgedetection/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/edgedetection/lib/python3.9/site-packages/torch/distributed/launch.py", line 193, in main() File "/home/edgedetection/lib/python3.9/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/home/edgedetection/lib/python3.9/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/home/edgedetection/lib/python3.9/site-packages/torch/distributed/run.py", line 710, in run elastic_launch( File "/home/edgedetection/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/edgedetection/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ./tools/train.py FAILED

请把这一行代码注释掉

https://github.com/MengyangPu/EDTER/blob/3fe76f3d938206ef9dc8b857a9767b8cd3d28fc7/mmseg/core/evaluation/eval_hooks.py#L30

因为我用的是分布式,所以没有检查不用分布式的流程,感谢您提供的细节和反馈。

您好,注释掉以后,还是在10000迭代时报错,如下: 2023-11-23 04:28:09,580 - mmseg - INFO - Saving checkpoint at 10000 iterations [>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 148/148, 6.8 task/s, elapsed: 22s, ETA: 0s2023-11-23 04:28:40,210 - mmseg - INFO - Exp name: EDTER_BIMLA_320x320_80k_bsds_bs_8.py Traceback (most recent call last): File "/home/EDTER/./tools/train.py", line 168, in main() File "/home/EDTER/./tools/train.py", line 157, in main train_segmentor( File "/home/EDTER/mmseg/apis/train.py", line 108, in train_segmentor runner.run(data_loaders, cfg.workflow, cfg.total_iters) File "/home/edgedetection/lib/python3.9/site-packages/mmcv/runner/iter_based_runner.py", line 134, in run iter_runner(iter_loaders[i], **kwargs) File "/home/edgedetection/lib/python3.9/site-packages/mmcv/runner/iter_based_runner.py", line 67, in train self.call_hook('after_train_iter') File "/home/edgedetection/lib/python3.9/site-packages/mmcv/runner/base_runner.py", line 309, in call_hook getattr(hook, fn_name)(self) File "/home/edgedetection/lib/python3.9/site-packages/mmcv/runner/hooks/logger/base.py", line 153, in after_train_iter self.log(runner) File "/home/edgedetection/lib/python3.9/site-packages/mmcv/runner/hooks/logger/text.py", line 234, in log self._log_info(log_dict, runner) File "/home/edgedetection/lib/python3.9/site-packages/mmcv/runner/hooks/logger/text.py", line 154, in _log_info f'data_time: {log_dict["data_time"]:.3f}, ' KeyError: 'data_time' ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 111) of binary: /home/jovyan/extra_lib/edgedetection/bin/python Traceback (most recent call last): File "/home/edgedetection/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/edgedetection/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/edgedetection/lib/python3.9/site-packages/torch/distributed/launch.py", line 193, in main() File "/home/edgedetection/lib/python3.9/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/home/edgedetection/lib/python3.9/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/home/edgedetection/lib/python3.9/site-packages/torch/distributed/run.py", line 710, in run elastic_launch( File "/home/edgedetection/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/edgedetection/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

./tools/train.py FAILED

MengyangPu commented 9 months ago

解决方案: 1,还是请尝试使用分布式 2,如果坚持不使用分布式,测试的时候调用的是这个函数https://github.com/MengyangPu/EDTER/blob/3fe76f3d938206ef9dc8b857a9767b8cd3d28fc7/mmseg/apis/test.py#L16 该函数并不适用于EDTER的测试。 可以参考分布式调用的测试函数进行修改https://github.com/MengyangPu/EDTER/blob/3fe76f3d938206ef9dc8b857a9767b8cd3d28fc7/mmseg/apis/test.py#L73 另外,还需要修改这个函数:https://github.com/MengyangPu/EDTER/blob/3fe76f3d938206ef9dc8b857a9767b8cd3d28fc7/mmseg/core/evaluation/eval_hooks.py#L23-L29 请参考分布式调用的函数: https://github.com/MengyangPu/EDTER/blob/3fe76f3d938206ef9dc8b857a9767b8cd3d28fc7/mmseg/core/evaluation/eval_hooks.py#L68-L81

Snailgoo commented 9 months ago

解决方案: 1,还是请尝试使用分布式 2,如果坚持不使用分布式,测试的时候调用的是这个函数

https://github.com/MengyangPu/EDTER/blob/3fe76f3d938206ef9dc8b857a9767b8cd3d28fc7/mmseg/apis/test.py#L16

该函数并不适用于EDTER的测试。 可以参考分布式调用的测试函数进行修改 https://github.com/MengyangPu/EDTER/blob/3fe76f3d938206ef9dc8b857a9767b8cd3d28fc7/mmseg/apis/test.py#L73

另外,还需要修改这个函数: https://github.com/MengyangPu/EDTER/blob/3fe76f3d938206ef9dc8b857a9767b8cd3d28fc7/mmseg/core/evaluation/eval_hooks.py#L23-L29

请参考分布式调用的函数: https://github.com/MengyangPu/EDTER/blob/3fe76f3d938206ef9dc8b857a9767b8cd3d28fc7/mmseg/core/evaluation/eval_hooks.py#L68-L81

您好,感谢您的回复与指导,我按照分布式训练的方式重新进行了训练,可以跑起来,但是还是遇到了问题,报错如下(后续再试一下您建议的单卡训练修改代码部分); 2023-11-23 08:19:35,045 - mmseg - INFO - Saving checkpoint at 10000 iterations /home/EDTER/work_dirs/EDTER_BIMLA_320x320_80k_bsds_bs_8 /home/EDTER/work_dirs/EDTER_BIMLA_320x320_80k_bsds_bs_8/10000/mat /home/EDTER/work_dirs/EDTER_BIMLA_320x320_80k_bsds_bs_8/10000/png /home/EDTER/work_dirs/EDTER_BIMLA_320x320_80k_bsds_bs_8 /home/EDTER/work_dirs/EDTER_BIMLA_320x320_80k_bsds_bs_8/10000/mat /home/EDTER/work_dirs/EDTER_BIMLA_320x320_80k_bsds_bs_8/10000/png [>>>>>>>>>>>>>>>>>>>>>>> ] 70/148, 5.7 task/s, elapsed: 12s, ETA: 14s20.450742483139038 Traceback (most recent call last): File "/home/EDTER/./tools/train.py", line 168, in main() File "/home/EDTER/./tools/train.py", line 157, in main train_segmentor( File "/home/EDTER/mmseg/apis/train.py", line 108, in train_segmentor runner.run(data_loaders, cfg.workflow, cfg.total_iters) File "/home/edgedetection/lib/python3.9/site-packages/mmcv/runner/iter_based_runner.py", line 134, in run iter_runner(iter_loaders[i], **kwargs) File "/home/edgedetection/lib/python3.9/site-packages/mmcv/runner/iter_based_runner.py", line 67, in train self.call_hook('after_train_iter') File "/home/edgedetection/lib/python3.9/site-packages/mmcv/runner/base_runner.py", line 309, in call_hook getattr(hook, fn_name)(self) File "/home/edgedetection/lib/python3.9/site-packages/mmcv/runner/hooks/logger/base.py", line 153, in after_train_iter self.log(runner) File "/home/edgedetection/lib/python3.9/site-packages/mmcv/runner/hooks/logger/text.py", line 234, in log self._log_info(log_dict, runner) File "/home/edgedetection/lib/python3.9/site-packages/mmcv/runner/hooks/logger/text.py", line 154, in _log_info f'data_time: {log_dict["data_time"]:.3f}, ' KeyError: 'data_time' [>>>>>>>>>>>>>>>>>>>>>>>>>>> ] 82/148, 6.0 task/s, elapsed: 14s, ETA: 11sWARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4189 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 4190) of binary: /home/jovyan/extra_lib/edgedetection/bin/python Traceback (most recent call last): File "/home/edgedetection/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/edgedetection/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/edgedetection/lib/python3.9/site-packages/torch/distributed/launch.py", line 193, in main() File "/home/edgedetection/lib/python3.9/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/home/edgedetection/lib/python3.9/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/home/edgedetection/lib/python3.9/site-packages/torch/distributed/run.py", line 710, in run elastic_launch( File "/home/edgedetection/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/edgedetection/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

./tools/train.py FAILED

MengyangPu commented 9 months ago

File "/home/edgedetection/lib/python3.9/site-packages/mmcv/runner/hooks/logger/text.py", line 154, in _log_info f'data_time: {log_dict["data_time"]:.3f}, ' KeyError: 'data_time'

你好,看你的描述应该是mmcv的问题: File "/home/edgedetection/lib/python3.9/site-packages/mmcv/runner/hooks/logger/text.py", line 154, in _log_info f'data_time: {log_dict["data_time"]:.3f}, ' KeyError: 'data_time' 非常抱歉,我没有遇到类似的问题,用的环境是: python=3.7 ,mmcv-full==1.2.2 我找了几个类似的问题和回答希望对你有帮助:https://github.com/open-mmlab/mmsegmentation/issues/1502https://github.com/lhoyer/DAFormer/issues/7

Snailgoo commented 9 months ago

File "/home/edgedetection/lib/python3.9/site-packages/mmcv/runner/hooks/logger/text.py", line 154, in _log_info f'data_time: {log_dict["data_time"]:.3f}, ' KeyError: 'data_time'

你好,看你的描述应该是mmcv的问题: File "/home/edgedetection/lib/python3.9/site-packages/mmcv/runner/hooks/logger/text.py", line 154, in _log_info f'data_time: {log_dict["data_time"]:.3f}, ' KeyError: 'data_time' 非常抱歉,我没有遇到类似的问题,用的环境是: python=3.7 ,mmcv-full==1.2.2 我找了几个类似的问题和回答希望对你有帮助:open-mmlab/mmsegmentation#1502https://github.com/lhoyer/DAFormer/issues/7。

您好,感谢您提供的建议,按照建议修改后已经跑通了全局阶段的训练,但在训练局部模型的时候报错了,能否帮忙看下是什么问题,感谢: File "/home/EDTER/mmseg/models/decode_heads/local8x8_fuse_head.py", line 36, in forward fuse_features = local_features * (scale+1) +shift RuntimeError: The size of tensor a (624) must match the size of tensor b (625) at non-singleton dimension 3 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2981545) of binary: /home/edgedetection/bin/python

MengyangPu commented 9 months ago

@Snailgoo 你好,建议用debug的方式,定位到 "/home/EDTER/mmseg/models/decode_heads/local8x8_fuse_head.py", line 36,查看一下各个特征的shape是否一致。

Snailgoo commented 9 months ago

@Snailgoo 你好,建议用debug的方式,定位到 "/home/EDTER/mmseg/models/decode_heads/local8x8_fuse_head.py", line 36,查看一下各个特征的shape是否一致。

您好,我打印了下=====local_features=====global_features=====,训练的时候分别为torch.Size([4, 128, 320, 320]) torch.Size([4, 128, 320, 320]);但到了checkpoint验证的时候变成了 torch.Size([1, 128, 456, 624]) torch.Size([1, 128, 461, 625]),也就是global_features变成了456, 624,与local_features 461, 625不一致了

MengyangPu commented 9 months ago

@Snailgoo 请确认验证时是否采用test_cfg = dict(mode='slide', crop_size=(320, 320), stride=(280, 280)) 以及输入图像的大小是否是10的倍数,因为解码器是由反卷积构成的 请继续采用debug模式查看该函数:https://github.com/MengyangPu/EDTER/blob/3fe76f3d938206ef9dc8b857a9767b8cd3d28fc7/mmseg/models/segmentors/encoder_decoder_local8x8.py#L320

RunyuZhu commented 2 months ago

模型对显存要求太高了,尤其是StageⅡ,建议租显卡

速度曼一点也没有关系,目前我测试了,确实DETER是sota的结果,泛化能力也很强

您好,在训练阶段,如果每块GPU设置训练1张320x320的图像,每张GPU需要15G;如果每块GPU设置训练4张320x320的图像,每张GPU需要越25G。如果是3090,可能需要四张卡。训练第二阶段时,第一阶段的参数是固定的,因此需要的显存略有下降。

十分抱歉打扰您,我按照您的readme中训练stage1的流程走了一遍,想要自己训练复现以下您的工作,但是在运行dist_train.py命令脚本的时候报错了,错误如下: ValueError: Unsupported nproc_per_node value: configs/bsds/EDTER_BIMLA_320x320_80k_bsds_aug_bs_8.py 使用的命令: cd EDTER bash ./tools/dist_train.sh configs/bsds/EDTER_BIMLA_320x320_80k_bsds_bs_8.py 1 我设置了GPU数量为1,因为我只有单卡 想向您咨询一下针对当前问题,我是否需要更改dist_train.sh中的命令代码从而使训练可以正常进行 希望得到您的回复 谢谢

RunyuZhu commented 2 months ago

模型对显存要求太高了,尤其是StageⅡ,建议租显卡

速度曼一点也没有关系,目前我测试了,确实DETER是sota的结果,泛化能力也很强

您好,在训练阶段,如果每块GPU设置训练1张320x320的图像,每张GPU需要15G;如果每块GPU设置训练4张320x320的图像,每张GPU需要越25G。如果是3090,可能需要四张卡。训练第二阶段时,第一阶段的参数是固定的,因此需要的显存略有下降。

十分抱歉打扰您,我按照您的readme中训练stage1的流程走了一遍,想要自己训练复现以下您的工作,但是在运行dist_train.py命令脚本的时候报错了,错误如下: ValueError: Unsupported nproc_per_node value: configs/bsds/EDTER_BIMLA_320x320_80k_bsds_aug_bs_8.py 使用的命令: cd EDTER bash ./tools/dist_train.sh configs/bsds/EDTER_BIMLA_320x320_80k_bsds_bs_8.py 1 我设置了GPU数量为1,因为我只有单卡 想向您咨询一下针对当前问题,我是否需要更改dist_train.sh中的命令代码从而使训练可以正常进行 希望得到您的回复 谢谢

RunyuZhu commented 2 months ago

模型对显存要求太高了,尤其是StageⅡ,建议租显卡

速度曼一点也没有关系,目前我测试了,确实DETER是sota的结果,泛化能力也很强

您好,在训练阶段,如果每块GPU设置训练1张320x320的图像,每张GPU需要15G;如果每块GPU设置训练4张320x320的图像,每张GPU需要越25G。如果是3090,可能需要四张卡。训练第二阶段时,第一阶段的参数是固定的,因此需要的显存略有下降。

十分抱歉打扰您,我按照您的readme中训练stage1的流程走了一遍,想要自己训练复现以下您的工作,但是在运行dist_train.py命令脚本的时候报错了,错误如下: ValueError: Unsupported nproc_per_node value: configs/bsds/EDTER_BIMLA_320x320_80k_bsds_aug_bs_8.py 使用的命令: cd EDTER bash ./tools/dist_train.sh configs/bsds/EDTER_BIMLA_320x320_80k_bsds_bs_8.py 1 我设置了GPU数量为1,因为我只有单卡 想向您咨询一下针对当前问题,我是否需要更改dist_train.sh中的命令代码从而使训练可以正常进行 希望得到您的回复 谢谢

RunyuZhu commented 2 months ago

模型对显存要求太高了,尤其是StageⅡ,建议租显卡

速度曼一点也没有关系,目前我测试了,确实DETER是sota的结果,泛化能力也很强

您好,在训练阶段,如果每块GPU设置训练1张320x320的图像,每张GPU需要15G;如果每块GPU设置训练4张320x320的图像,每张GPU需要越25G。如果是3090,可能需要四张卡。训练第二阶段时,第一阶段的参数是固定的,因此需要的显存略有下降。

十分抱歉打扰您,我按照您的readme中训练stage1的流程走了一遍,想要自己训练复现以下您的工作,但是在运行dist_train.py命令脚本的时候报错了,错误如下: ValueError: Unsupported nproc_per_node value: configs/bsds/EDTER_BIMLA_320x320_80k_bsds_aug_bs_8.py 使用的命令: cd EDTER bash ./tools/dist_train.sh configs/bsds/EDTER_BIMLA_320x320_80k_bsds_bs_8.py 1 我设置了GPU数量为1,因为我只有单卡 想向您咨询一下针对当前问题,我是否需要更改dist_train.sh中的命令代码从而使训练可以正常进行 希望得到您的回复 谢谢

hhqweasd commented 2 months ago

一般是需要根据自己的硬件调整一下代码。目前公开的代码应该是多卡训练的代码,您可以花一些时间查阅相关资料自己尝试解决这个问题哈。

On Sat, 22 Jun 2024 at 17:27, RainYu @.***> wrote:

模型对显存要求太高了,尤其是StageⅡ,建议租显卡

速度曼一点也没有关系,目前我测试了,确实DETER是sota的结果,泛化能力也很强

您好,在训练阶段,如果每块GPU设置训练1张320x320的图像,每张GPU需要15G;如果每块GPU设置训练4张320x320的图像,每张GPU需要越25G。如果是3090,可能需要四张卡。训练第二阶段时,第一阶段的参数是固定的,因此需要的显存略有下降。

十分抱歉打扰您,我按照您的readme中训练stage1的流程走了一遍,想要自己训练复现以下您的工作,但是在运行dist_train.py 命令脚本的时候报错了,错误如下: ValueError: Unsupported nproc_per_node value: configs/bsds/EDTER_BIMLA_320x320_80k_bsds_aug_bs_8.py 使用的命令:

cd EDTER bash ./tools/dist_train.sh configs/bsds/EDTER_BIMLA_320x320_80k_bsds_bs_8.py 1 我设置了GPU数量为1,因为我只有单卡 想向您咨询一下针对当前问题,我是否需要更改dist_train.sh中的命令代码从而使训练可以正常进行 希望得到您的回复 谢谢

— Reply to this email directly, view it on GitHub https://github.com/MengyangPu/EDTER/issues/32#issuecomment-2183958848, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALF3JCNHL7CGQ2Z3U6NJBUTZIU7QRAVCNFSM6AAAAABJXJEMUSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBTHE2TQOBUHA . You are receiving this because you commented.Message ID: @.***>