Open Xie-Muxi-BK opened 1 year ago
你要下载他的预训练模型放在pretrain里面,不过我也不知道如何重头开始训练,请问你解决了吗
你要下载他的预训练模型放在pretrain里面,不过我也不知道如何重头开始训练,请问你解决了吗
哦哦,我看错了,这个pretrain内的权重是Vit的预训练权重
你要下载他的预训练模型放在pretrain里面,不过我也不知道如何重头开始训练,请问你解决了吗
哦哦,我看错了,这个pretrain内的权重是Vit的预训练权重
这个截图我截错了,但是我记得我当时更换了的,可能是没上传到吧。这个路径问题我当时看到报错就解决了,因为作者放出来的BSDS数据集下载不了,我是用NYUD数据集测试的,我这边运行 dist_train.sh 配置参数第一步就报错了,我是DL小白,无能为力就放弃了
我是使用BSDS数据集训练的,可以训练,代码应该没问题,可能你环境配置有问题
hi!i ‘d like to ask you some questions about DETER's code, Would it be convenient for you?
请问您运行成功了吗
各位,下载了HED-BSDS 数据集以及VOC 的数据,因为我卡是12g的只能将两阶段的输入宽高调整为160,和80 现在训练是跑起来了,但是结果和和与训练的EDTER-BSDS-VOC-StageII.pth的结果相差很多,别的地方我没有改 有没有朋友也是我这种情况的,不知道怎么解决,我下载准备买一块3090 的显卡不知到能不能复现预训练的结果
各位,下载了HED-BSDS 数据集以及VOC 的数据,因为我卡是12g的只能将两阶段的输入宽高调整为160,和80 现在训练是跑起来了,但是结果和和与训练的EDTER-BSDS-VOC-StageII.pth的结果相差很多,别的地方我没有改 有没有朋友也是我这种情况的,不知道怎么解决,我下载准备买一块3090 的显卡不知到能不能复现预训练的结果
3090多半还是不行,模型对显存要求太高了,尤其是StageⅡ,建议租显卡跑
论文说的15,16g 显存,那3090应该可以吧,两块3090呢,目前预算只有这么多
速度曼一点也没有关系,目前我测试了,确实DETER是sota的结果,泛化能力也很强
hi!i ‘d like to ask you some questions about DETER's code, Would it be convenient for you?
hello,i have some puzzles about the multi-scale test of EDTER, would u plz give me a hand? Extremly Appreicated for that
模型对显存要求太高了,尤其是StageⅡ,建议租显卡
速度曼一点也没有关系,目前我测试了,确实DETER是sota的结果,泛化能力也很强
您好,在训练阶段,如果每块GPU设置训练1张320x320的图像,每张GPU需要15G;如果每块GPU设置训练4张320x320的图像,每张GPU需要越25G。如果是3090,可能需要四张卡。训练第二阶段时,第一阶段的参数是固定的,因此需要的显存略有下降。
请问各位用的python库都是哪些版本的,我按照readme里进行环境配置,目前一直跑不起来。
@Xakurain 我这边装完是这样的,可以跑起来
Package |Version | Editable project location addict 2.4.0 appdirs 1.4.4 certifi 2022.12.7 cityscapesScripts 2.2.1 clip 1.0 coloredlogs 15.0.1 contourpy 1.0.7 cycler 0.11.0 fonttools 4.39.3 ftfy 6.1.1 future 0.18.3 h5py 3.8.0 humanfriendly 10.0 importlib-resources 5.12.0 kiwisolver 1.4.4 matplotlib 3.7.1 mmcv-full 1.2.2 mmsegmentation 0.6.0 /root/EDTER-main numpy 1.24.2 opencv-python 4.7.0.72 packaging 23.0 Pillow 9.5.0 pip 23.0.1 pyparsing 3.0.9 pyquaternion 0.9.9 python-dateutil 2.8.2 PyYAML 6.0 regex 2023.3.23 scipy 1.10.1 setuptools 65.6.3 six 1.16.0 torch 1.6.0+cu101 torchvision 0.7.0+cu101 tqdm 4.65.0 typing 3.7.4.3 wcwidth 0.2.6 wheel 0.38.4 yapf 0.32.0 zipp 3.15.0
packages in environment at /root/.local/conda/envs/edge:
Name | Version | Build Channel
_libgcc_mutex 0.1 main https://mirrors.aliyun.com/anaconda/pkgs/main
_openmp_mutex 5.1 1_gnu https://mirrors.aliyun.com/anaconda/pkgs/main
addict 2.4.0 pypi_0 pypi
appdirs 1.4.4 pypi_0 pypi
ca-certificates 2023.01.10 h06a4308_0 https://mirrors.aliyun.com/anaconda/pkgs/main
certifi 2022.12.7 py38h06a4308_0 https://mirrors.aliyun.com/anaconda/pkgs/main
cityscapesscripts 2.2.1 pypi_0 pypi
clip 1.0 pypi_0 pypi
coloredlogs 15.0.1 pypi_0 pypi
contourpy 1.0.7 pypi_0 pypi
cycler 0.11.0 pypi_0 pypi
fonttools 4.39.3 pypi_0 pypi
ftfy 6.1.1 pypi_0 pypi
future 0.18.3 pypi_0 pypi
h5py 3.8.0 pypi_0 pypi
humanfriendly 10.0 pypi_0 pypi
importlib-resources 5.12.0 pypi_0 pypi
kiwisolver 1.4.4 pypi_0 pypi
ld_impl_linux-64 2.38 h1181459_1 https://mirrors.aliyun.com/anaconda/pkgs/main
libffi 3.4.2 h6a678d5_6 https://mirrors.aliyun.com/anaconda/pkgs/main
libgcc-ng 11.2.0 h1234567_1 https://mirrors.aliyun.com/anaconda/pkgs/main
libgomp 11.2.0 h1234567_1 https://mirrors.aliyun.com/anaconda/pkgs/main
libstdcxx-ng 11.2.0 h1234567_1 https://mirrors.aliyun.com/anaconda/pkgs/main
matplotlib 3.7.1 pypi_0 pypi
mmcv-full 1.2.2 pypi_0 pypi
mmsegmentation 0.6.0 dev_0
速度曼一点也没有关系,目前我测试了,确实DETER是sota的结果,泛化能力也很强
有没有遇到这个问题: tcp_store = TCPStore(hostname, port, world_size, False, timeout) RuntimeError: connect() timed out. Original timeout was 1800000 ms. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 289) of binary
速度曼一点也没有关系,目前我测试了,确实DETER是sota的结果,泛化能力也很强
有没有遇到这个问题: tcp_store = TCPStore(hostname, port, world_size, False, timeout) RuntimeError: connect() timed out. Original timeout was 1800000 ms. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 289) of binary
通过您的描述看应该是分布式或者网络故障问题。我们在tools/train.py的第20-21行默认设置了os.environ['MASTER_ADDR']='127.0.0.4' os.environ['MASTER_PORT']='29504'
请确认该address和port是否未被占用。
https://github.com/MengyangPu/EDTER/blob/fc3e182c267653831c49b7ae6a06c04cebc858fd/tools/train.py#L20C2-L21C2
速度曼一点也没有关系,目前我测试了,确实DETER是sota的结果,泛化能力也很强
有没有遇到这个问题: tcp_store = TCPStore(hostname, port, world_size, False, timeout) RuntimeError: connect() timed out. Original timeout was 1800000 ms. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 289) of binary
希望这个问题对您有帮助。
速度曼一点也没有关系,目前我测试了,确实DETER是sota的结果,泛化能力也很强
有没有遇到这个问题: tcp_store = TCPStore(hostname, port, world_size, False, timeout) RuntimeError: connect() timed out. Original timeout was 1800000 ms. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 289) of binary
通过您的描述看应该是分布式或者网络故障问题。我们在tools/train.py的第20-21行默认设置了
os.environ['MASTER_ADDR']='127.0.0.4' os.environ['MASTER_PORT']='29504'
请确认该address和port是否未被占用。 https://github.com/MengyangPu/EDTER/blob/fc3e182c267653831c49b7ae6a06c04cebc858fd/tools/train.py#L20C2-L21C2
您好,能否只用单卡训练?我设置了训练参数为 bash ./tools/dist_train.sh configs/bsds/EDTER_BIMLA_320x320_80k_bsds_bs_8.py 1 --gpu-ids 1 只使用1号卡,发现还是遇到同样的问题,网段和端口也换了其它的
速度曼一点也没有关系,目前我测试了,确实DETER是sota的结果,泛化能力也很强
有没有遇到这个问题: tcp_store = TCPStore(hostname, port, world_size, False, timeout) RuntimeError: connect() timed out. Original timeout was 1800000 ms. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 289) of binary
通过您的描述看应该是分布式或者网络故障问题。我们在tools/train.py的第20-21行默认设置了
os.environ['MASTER_ADDR']='127.0.0.4' os.environ['MASTER_PORT']='29504'
请确认该address和port是否未被占用。 https://github.com/MengyangPu/EDTER/blob/fc3e182c267653831c49b7ae6a06c04cebc858fd/tools/train.py#L20C2-L21C2您好,能否只用单卡训练?我设置了训练参数为 bash ./tools/dist_train.sh configs/bsds/EDTER_BIMLA_320x320_80k_bsds_bs_8.py 1 --gpu-ids 1 只使用1号卡,发现还是遇到同样的问题,网段和端口也换了其它的
您好,如果不用分布式的话,请将launcher设置为none(不使用分布式),在https://github.com/MengyangPu/EDTER/blob/3df1a182a095fe1f52a55695d7bd7ac727641cab/tools/train.py#L57-L61
速度曼一点也没有关系,目前我测试了,确实DETER是sota的结果,泛化能力也很强
有没有遇到这个问题: tcp_store = TCPStore(hostname, port, world_size, False, timeout) RuntimeError: connect() timed out. Original timeout was 1800000 ms. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 289) of binary
通过您的描述看应该是分布式或者网络故障问题。我们在tools/train.py的第20-21行默认设置了
os.environ['MASTER_ADDR']='127.0.0.4' os.environ['MASTER_PORT']='29504'
请确认该address和port是否未被占用。 https://github.com/MengyangPu/EDTER/blob/fc3e182c267653831c49b7ae6a06c04cebc858fd/tools/train.py#L20C2-L21C2您好,能否只用单卡训练?我设置了训练参数为 bash ./tools/dist_train.sh configs/bsds/EDTER_BIMLA_320x320_80k_bsds_bs_8.py 1 --gpu-ids 1 只使用1号卡,发现还是遇到同样的问题,网段和端口也换了其它的
您好,如果不用分布式的话,请将launcher设置为none(不使用分布式),在
好的,非常感谢耐心指导,已经能跑起来BSDS数据集的训练了。 还有一个问题请教下,就是我自己的数据集按照BSDS格式构建的,但是test只有图片,没有.mat文件,训练的时候报错 File "/EDTER/mmseg/datasets/builder.py", line 73, in build_dataset dataset = build_from_cfg(cfg, DATASETS, default_args) File "/lib/python3.9/site-packages/mmcv/utils/registry.py", line 55, in build_from_cfg raise type(e)(f'{obj_cls.name}: {e}') ValueError: BSDSDataset: not enough values to unpack (expected 2, got 1) 是不是必须要有.mat文件,这个文件是怎么生成的呢?如果不生成,需要改什么地方才能跑起来咱的代码呢?
速度曼一点也没有关系,目前我测试了,确实DETER是sota的结果,泛化能力也很强
有没有遇到这个问题: tcp_store = TCPStore(hostname, port, world_size, False, timeout) RuntimeError: connect() timed out. Original timeout was 1800000 ms. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 289) of binary
通过您的描述看应该是分布式或者网络故障问题。我们在tools/train.py的第20-21行默认设置了
os.environ['MASTER_ADDR']='127.0.0.4' os.environ['MASTER_PORT']='29504'
请确认该address和port是否未被占用。 https://github.com/MengyangPu/EDTER/blob/fc3e182c267653831c49b7ae6a06c04cebc858fd/tools/train.py#L20C2-L21C2您好,能否只用单卡训练?我设置了训练参数为 bash ./tools/dist_train.sh configs/bsds/EDTER_BIMLA_320x320_80k_bsds_bs_8.py 1 --gpu-ids 1 只使用1号卡,发现还是遇到同样的问题,网段和端口也换了其它的
您好,如果不用分布式的话,请将launcher设置为none(不使用分布式),在 https://github.com/MengyangPu/EDTER/blob/3df1a182a095fe1f52a55695d7bd7ac727641cab/tools/train.py#L57-L61
好的,非常感谢耐心指导,已经能跑起来BSDS数据集的训练了。 还有一个问题请教下,就是我自己的数据集按照BSDS格式构建的,但是test只有图片,没有.mat文件,训练的时候报错 File "/EDTER/mmseg/datasets/builder.py", line 73, in build_dataset dataset = build_from_cfg(cfg, DATASETS, default_args) File "/lib/python3.9/site-packages/mmcv/utils/registry.py", line 55, in build_from_cfg raise type(e)(f'{obj_cls.name}: {e}') ValueError: BSDSDataset: not enough values to unpack (expected 2, got 1) 是不是必须要有.mat文件,这个文件是怎么生成的呢?如果不生成,需要改什么地方才能跑起来咱的代码呢?
您好,不是非要有.mat文件,只是为了占位,只要在test.txt里有这么一列就可以,并不会真的读取到这个数据。
速度曼一点也没有关系,目前我测试了,确实DETER是sota的结果,泛化能力也很强
有没有遇到这个问题: tcp_store = TCPStore(hostname, port, world_size, False, timeout) RuntimeError: connect() timed out. Original timeout was 1800000 ms. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 289) of binary
通过您的描述看应该是分布式或者网络故障问题。我们在tools/train.py的第20-21行默认设置了
os.environ['MASTER_ADDR']='127.0.0.4' os.environ['MASTER_PORT']='29504'
请确认该address和port是否未被占用。 https://github.com/MengyangPu/EDTER/blob/fc3e182c267653831c49b7ae6a06c04cebc858fd/tools/train.py#L20C2-L21C2您好,能否只用单卡训练?我设置了训练参数为 bash ./tools/dist_train.sh configs/bsds/EDTER_BIMLA_320x320_80k_bsds_bs_8.py 1 --gpu-ids 1 只使用1号卡,发现还是遇到同样的问题,网段和端口也换了其它的
您好,如果不用分布式的话,请将launcher设置为none(不使用分布式),在 https://github.com/MengyangPu/EDTER/blob/3df1a182a095fe1f52a55695d7bd7ac727641cab/tools/train.py#L57-L61
好的,非常感谢耐心指导,已经能跑起来BSDS数据集的训练了。 还有一个问题请教下,就是我自己的数据集按照BSDS格式构建的,但是test只有图片,没有.mat文件,训练的时候报错 File "/EDTER/mmseg/datasets/builder.py", line 73, in build_dataset dataset = build_from_cfg(cfg, DATASETS, default_args) File "/lib/python3.9/site-packages/mmcv/utils/registry.py", line 55, in build_from_cfg raise type(e)(f'{obj_cls.name}: {e}') ValueError: BSDSDataset: not enough values to unpack (expected 2, got 1) 是不是必须要有.mat文件,这个文件是怎么生成的呢?如果不生成,需要改什么地方才能跑起来咱的代码呢?
您好,不是非要有.mat文件,只是为了占位,只要在test.txt里有这么一列就可以,并不会真的读取到这个数据。
您好,感谢您的回复!我按照格式只在test.txt中加了一列.mat,可以训练起来,但在10000 iterations结束时,报错 FileNotFoundError: [Errno 2] No such file or directory: '/test/10000.mat'
能看一下是具体哪一行代码报错吗?
速度曼一点也没有关系,目前我测试了,确实DETER是sota的结果,泛化能力也很强
有没有遇到这个问题: tcp_store = TCPStore(hostname, port, world_size, False, timeout) RuntimeError: connect() timed out. Original timeout was 1800000 ms. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 289) of binary
通过您的描述看应该是分布式或者网络故障问题。我们在tools/train.py的第20-21行默认设置了
os.environ['MASTER_ADDR']='127.0.0.4' os.environ['MASTER_PORT']='29504'
请确认该address和port是否未被占用。 https://github.com/MengyangPu/EDTER/blob/fc3e182c267653831c49b7ae6a06c04cebc858fd/tools/train.py#L20C2-L21C2您好,能否只用单卡训练?我设置了训练参数为 bash ./tools/dist_train.sh configs/bsds/EDTER_BIMLA_320x320_80k_bsds_bs_8.py 1 --gpu-ids 1 只使用1号卡,发现还是遇到同样的问题,网段和端口也换了其它的
您好,如果不用分布式的话,请将launcher设置为none(不使用分布式),在 https://github.com/MengyangPu/EDTER/blob/3df1a182a095fe1f52a55695d7bd7ac727641cab/tools/train.py#L57-L61
好的,非常感谢耐心指导,已经能跑起来BSDS数据集的训练了。 还有一个问题请教下,就是我自己的数据集按照BSDS格式构建的,但是test只有图片,没有.mat文件,训练的时候报错 File "/EDTER/mmseg/datasets/builder.py", line 73, in build_dataset dataset = build_from_cfg(cfg, DATASETS, default_args) File "/lib/python3.9/site-packages/mmcv/utils/registry.py", line 55, in build_from_cfg raise type(e)(f'{obj_cls.name}: {e}') ValueError: BSDSDataset: not enough values to unpack (expected 2, got 1) 是不是必须要有.mat文件,这个文件是怎么生成的呢?如果不生成,需要改什么地方才能跑起来咱的代码呢?
您好,不是非要有.mat文件,只是为了占位,只要在test.txt里有这么一列就可以,并不会真的读取到这个数据。
您好,感谢您的回复!我按照格式只在test.txt中加了一列.mat,可以训练起来,但在10000 iterations结束时,报错 FileNotFoundError: [Errno 2] No such file or directory: '/test/10000.mat'
您好,具体报错如下: 2023-11-23 01:26:30,391 - mmseg - INFO - Iter [9980/80000] lr: 8.881e-07, eta: 8:02:39, time: 0.363, data_time: 0.002, memory: 29987, decode.loss_seg: 199.0960, aux_0.loss_seg: 89.2975, aux_1.loss_seg: 84.7263, aux_2.loss_seg: 84.0529, aux_3.loss_seg: 83.4825, aux_4.loss_seg: 83.7665, aux_5.loss_seg: 84.4057, aux_6.loss_seg: 85.0462, aux_7.loss_seg: 85.2248, loss: 879.0984 2023-11-23 01:26:37,711 - mmseg - INFO - Saving checkpoint at 10000 iterations [>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 148/148, 6.9 task/s, elapsed: 21s, ETA: 0sTraceback (most recent call last): File "/home/EDTER/./tools/train.py", line 168, in main() File "/home/EDTER/./tools/train.py", line 157, in main train_segmentor( File "/home/EDTER/mmseg/apis/train.py", line 108, in train_segmentor runner.run(data_loaders, cfg.workflow, cfg.total_iters) File "/home/edgedetection/lib/python3.9/site-packages/mmcv/runner/iter_based_runner.py", line 134, in run iter_runner(iter_loaders[i], **kwargs) File "/home/edgedetection/lib/python3.9/site-packages/mmcv/runner/iter_based_runner.py", line 67, in train self.call_hook('after_train_iter') File "/home/edgedetection/lib/python3.9/site-packages/mmcv/runner/base_runner.py", line 309, in call_hook getattr(hook, fn_name)(self) File "/home/EDTER/mmseg/core/evaluation/eval_hooks.py", line 30, in after_train_iter self.evaluate(runner, results) File "/home/EDTER/mmseg/core/evaluation/eval_hooks.py", line 34, in evaluate eval_res = self.dataloader.dataset.evaluate( File "/home/EDTER/mmseg/datasets/custom.py", line 335, in evaluate gt_seg_maps = self.get_gt_seg_maps() File "/home/EDTER/mmseg/datasets/custom.py", line 235, in get_gt_seg_maps gt_seg_map = mmcv.imread( File "/home/edgedetection/lib/python3.9/site-packages/mmcv/image/io.py", line 203, in imread img_bytes = file_client.get(img_or_path) File "/home/edgedetection/lib/python3.9/site-packages/mmcv/fileio/file_client.py", line 993, in get return self.client.get(filepath) File "/home/edgedetection/lib/python3.9/site-packages/mmcv/fileio/file_client.py", line 518, in get with open(filepath, 'rb') as f: FileNotFoundError: [Errno 2] No such file or directory: '/home/EDTER/data/f_v0/test/10000.mat' ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3467) of binary: /home/edgedetection/bin/python Traceback (most recent call last): File "/home/edgedetection/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/edgedetection/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/edgedetection/lib/python3.9/site-packages/torch/distributed/launch.py", line 193, in main() File "/home/edgedetection/lib/python3.9/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/home/edgedetection/lib/python3.9/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/home/edgedetection/lib/python3.9/site-packages/torch/distributed/run.py", line 710, in run elastic_launch( File "/home/edgedetection/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/edgedetection/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ./tools/train.py FAILED
速度曼一点也没有关系,目前我测试了,确实DETER是sota的结果,泛化能力也很强
有没有遇到这个问题: tcp_store = TCPStore(hostname, port, world_size, False, timeout) RuntimeError: connect() timed out. Original timeout was 1800000 ms. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 289) of binary
通过您的描述看应该是分布式或者网络故障问题。我们在tools/train.py的第20-21行默认设置了
os.environ['MASTER_ADDR']='127.0.0.4' os.environ['MASTER_PORT']='29504'
请确认该address和port是否未被占用。 https://github.com/MengyangPu/EDTER/blob/fc3e182c267653831c49b7ae6a06c04cebc858fd/tools/train.py#L20C2-L21C2您好,能否只用单卡训练?我设置了训练参数为 bash ./tools/dist_train.sh configs/bsds/EDTER_BIMLA_320x320_80k_bsds_bs_8.py 1 --gpu-ids 1 只使用1号卡,发现还是遇到同样的问题,网段和端口也换了其它的
您好,如果不用分布式的话,请将launcher设置为none(不使用分布式),在 https://github.com/MengyangPu/EDTER/blob/3df1a182a095fe1f52a55695d7bd7ac727641cab/tools/train.py#L57-L61
好的,非常感谢耐心指导,已经能跑起来BSDS数据集的训练了。 还有一个问题请教下,就是我自己的数据集按照BSDS格式构建的,但是test只有图片,没有.mat文件,训练的时候报错 File "/EDTER/mmseg/datasets/builder.py", line 73, in build_dataset dataset = build_from_cfg(cfg, DATASETS, default_args) File "/lib/python3.9/site-packages/mmcv/utils/registry.py", line 55, in build_from_cfg raise type(e)(f'{obj_cls.name}: {e}') ValueError: BSDSDataset: not enough values to unpack (expected 2, got 1) 是不是必须要有.mat文件,这个文件是怎么生成的呢?如果不生成,需要改什么地方才能跑起来咱的代码呢?
您好,不是非要有.mat文件,只是为了占位,只要在test.txt里有这么一列就可以,并不会真的读取到这个数据。
您好,感谢您的回复!我按照格式只在test.txt中加了一列.mat,可以训练起来,但在10000 iterations结束时,报错 FileNotFoundError: [Errno 2] No such file or directory: '/test/10000.mat'
您好,具体报错如下: 2023-11-23 01:26:30,391 - mmseg - INFO - Iter [9980/80000] lr: 8.881e-07, eta: 8:02:39, time: 0.363, data_time: 0.002, memory: 29987, decode.loss_seg: 199.0960, aux_0.loss_seg: 89.2975, aux_1.loss_seg: 84.7263, aux_2.loss_seg: 84.0529, aux_3.loss_seg: 83.4825, aux_4.loss_seg: 83.7665, aux_5.loss_seg: 84.4057, aux_6.loss_seg: 85.0462, aux_7.loss_seg: 85.2248, loss: 879.0984 2023-11-23 01:26:37,711 - mmseg - INFO - Saving checkpoint at 10000 iterations [>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 148/148, 6.9 task/s, elapsed: 21s, ETA: 0sTraceback (most recent call last): File "/home/EDTER/./tools/train.py", line 168, in main() File "/home/EDTER/./tools/train.py", line 157, in main train_segmentor( File "/home/EDTER/mmseg/apis/train.py", line 108, in train_segmentor runner.run(data_loaders, cfg.workflow, cfg.total_iters) File "/home/edgedetection/lib/python3.9/site-packages/mmcv/runner/iter_based_runner.py", line 134, in run iter_runner(iter_loaders[i], **kwargs) File "/home/edgedetection/lib/python3.9/site-packages/mmcv/runner/iter_based_runner.py", line 67, in train self.call_hook('after_train_iter') File "/home/edgedetection/lib/python3.9/site-packages/mmcv/runner/base_runner.py", line 309, in call_hook getattr(hook, fn_name)(self) File "/home/EDTER/mmseg/core/evaluation/eval_hooks.py", line 30, in after_train_iter self.evaluate(runner, results) File "/home/EDTER/mmseg/core/evaluation/eval_hooks.py", line 34, in evaluate eval_res = self.dataloader.dataset.evaluate( File "/home/EDTER/mmseg/datasets/custom.py", line 335, in evaluate gt_seg_maps = self.get_gt_seg_maps() File "/home/EDTER/mmseg/datasets/custom.py", line 235, in get_gt_seg_maps gt_seg_map = mmcv.imread( File "/home/edgedetection/lib/python3.9/site-packages/mmcv/image/io.py", line 203, in imread img_bytes = file_client.get(img_or_path) File "/home/edgedetection/lib/python3.9/site-packages/mmcv/fileio/file_client.py", line 993, in get return self.client.get(filepath) File "/home/edgedetection/lib/python3.9/site-packages/mmcv/fileio/file_client.py", line 518, in get with open(filepath, 'rb') as f: FileNotFoundError: [Errno 2] No such file or directory: '/home/EDTER/data/f_v0/test/10000.mat' ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3467) of binary: /home/edgedetection/bin/python Traceback (most recent call last): File "/home/edgedetection/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/edgedetection/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/edgedetection/lib/python3.9/site-packages/torch/distributed/launch.py", line 193, in main() File "/home/edgedetection/lib/python3.9/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/home/edgedetection/lib/python3.9/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/home/edgedetection/lib/python3.9/site-packages/torch/distributed/run.py", line 710, in run elastic_launch( File "/home/edgedetection/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/edgedetection/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ./tools/train.py FAILED
请把这一行代码注释掉 https://github.com/MengyangPu/EDTER/blob/3fe76f3d938206ef9dc8b857a9767b8cd3d28fc7/mmseg/core/evaluation/eval_hooks.py#L30 因为我用的是分布式,所以没有检查不用分布式的流程,感谢您提供的细节和反馈。
速度曼一点也没有关系,目前我测试了,确实DETER是sota的结果,泛化能力也很强
有没有遇到这个问题: tcp_store = TCPStore(hostname, port, world_size, False, timeout) RuntimeError: connect() timed out. Original timeout was 1800000 ms. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 289) of binary
通过您的描述看应该是分布式或者网络故障问题。我们在tools/train.py的第20-21行默认设置了
os.environ['MASTER_ADDR']='127.0.0.4' os.environ['MASTER_PORT']='29504'
请确认该address和port是否未被占用。 https://github.com/MengyangPu/EDTER/blob/fc3e182c267653831c49b7ae6a06c04cebc858fd/tools/train.py#L20C2-L21C2您好,能否只用单卡训练?我设置了训练参数为 bash ./tools/dist_train.sh configs/bsds/EDTER_BIMLA_320x320_80k_bsds_bs_8.py 1 --gpu-ids 1 只使用1号卡,发现还是遇到同样的问题,网段和端口也换了其它的
您好,如果不用分布式的话,请将launcher设置为none(不使用分布式),在 https://github.com/MengyangPu/EDTER/blob/3df1a182a095fe1f52a55695d7bd7ac727641cab/tools/train.py#L57-L61
好的,非常感谢耐心指导,已经能跑起来BSDS数据集的训练了。 还有一个问题请教下,就是我自己的数据集按照BSDS格式构建的,但是test只有图片,没有.mat文件,训练的时候报错 File "/EDTER/mmseg/datasets/builder.py", line 73, in build_dataset dataset = build_from_cfg(cfg, DATASETS, default_args) File "/lib/python3.9/site-packages/mmcv/utils/registry.py", line 55, in build_from_cfg raise type(e)(f'{obj_cls.name}: {e}') ValueError: BSDSDataset: not enough values to unpack (expected 2, got 1) 是不是必须要有.mat文件,这个文件是怎么生成的呢?如果不生成,需要改什么地方才能跑起来咱的代码呢?
您好,不是非要有.mat文件,只是为了占位,只要在test.txt里有这么一列就可以,并不会真的读取到这个数据。
您好,感谢您的回复!我按照格式只在test.txt中加了一列.mat,可以训练起来,但在10000 iterations结束时,报错 FileNotFoundError: [Errno 2] No such file or directory: '/test/10000.mat'
您好,具体报错如下: 2023-11-23 01:26:30,391 - mmseg - INFO - Iter [9980/80000] lr: 8.881e-07, eta: 8:02:39, time: 0.363, data_time: 0.002, memory: 29987, decode.loss_seg: 199.0960, aux_0.loss_seg: 89.2975, aux_1.loss_seg: 84.7263, aux_2.loss_seg: 84.0529, aux_3.loss_seg: 83.4825, aux_4.loss_seg: 83.7665, aux_5.loss_seg: 84.4057, aux_6.loss_seg: 85.0462, aux_7.loss_seg: 85.2248, loss: 879.0984 2023-11-23 01:26:37,711 - mmseg - INFO - Saving checkpoint at 10000 iterations [>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 148/148, 6.9 task/s, elapsed: 21s, ETA: 0sTraceback (most recent call last): File "/home/EDTER/./tools/train.py", line 168, in main() File "/home/EDTER/./tools/train.py", line 157, in main train_segmentor( File "/home/EDTER/mmseg/apis/train.py", line 108, in train_segmentor runner.run(data_loaders, cfg.workflow, cfg.total_iters) File "/home/edgedetection/lib/python3.9/site-packages/mmcv/runner/iter_based_runner.py", line 134, in run iter_runner(iter_loaders[i], **kwargs) File "/home/edgedetection/lib/python3.9/site-packages/mmcv/runner/iter_based_runner.py", line 67, in train self.call_hook('after_train_iter') File "/home/edgedetection/lib/python3.9/site-packages/mmcv/runner/base_runner.py", line 309, in call_hook getattr(hook, fn_name)(self) File "/home/EDTER/mmseg/core/evaluation/eval_hooks.py", line 30, in after_train_iter self.evaluate(runner, results) File "/home/EDTER/mmseg/core/evaluation/eval_hooks.py", line 34, in evaluate eval_res = self.dataloader.dataset.evaluate( File "/home/EDTER/mmseg/datasets/custom.py", line 335, in evaluate gt_seg_maps = self.get_gt_seg_maps() File "/home/EDTER/mmseg/datasets/custom.py", line 235, in get_gt_seg_maps gt_seg_map = mmcv.imread( File "/home/edgedetection/lib/python3.9/site-packages/mmcv/image/io.py", line 203, in imread img_bytes = file_client.get(img_or_path) File "/home/edgedetection/lib/python3.9/site-packages/mmcv/fileio/file_client.py", line 993, in get return self.client.get(filepath) File "/home/edgedetection/lib/python3.9/site-packages/mmcv/fileio/file_client.py", line 518, in get with open(filepath, 'rb') as f: FileNotFoundError: [Errno 2] No such file or directory: '/home/EDTER/data/f_v0/test/10000.mat' ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3467) of binary: /home/edgedetection/bin/python Traceback (most recent call last): File "/home/edgedetection/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/edgedetection/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/edgedetection/lib/python3.9/site-packages/torch/distributed/launch.py", line 193, in main() File "/home/edgedetection/lib/python3.9/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/home/edgedetection/lib/python3.9/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/home/edgedetection/lib/python3.9/site-packages/torch/distributed/run.py", line 710, in run elastic_launch( File "/home/edgedetection/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/edgedetection/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ./tools/train.py FAILED
请把这一行代码注释掉
因为我用的是分布式,所以没有检查不用分布式的流程,感谢您提供的细节和反馈。
速度曼一点也没有关系,目前我测试了,确实DETER是sota的结果,泛化能力也很强
有没有遇到这个问题: tcp_store = TCPStore(hostname, port, world_size, False, timeout) RuntimeError: connect() timed out. Original timeout was 1800000 ms. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 289) of binary
通过您的描述看应该是分布式或者网络故障问题。我们在tools/train.py的第20-21行默认设置了
os.environ['MASTER_ADDR']='127.0.0.4' os.environ['MASTER_PORT']='29504'
请确认该address和port是否未被占用。 https://github.com/MengyangPu/EDTER/blob/fc3e182c267653831c49b7ae6a06c04cebc858fd/tools/train.py#L20C2-L21C2您好,能否只用单卡训练?我设置了训练参数为 bash ./tools/dist_train.sh configs/bsds/EDTER_BIMLA_320x320_80k_bsds_bs_8.py 1 --gpu-ids 1 只使用1号卡,发现还是遇到同样的问题,网段和端口也换了其它的
您好,如果不用分布式的话,请将launcher设置为none(不使用分布式),在 https://github.com/MengyangPu/EDTER/blob/3df1a182a095fe1f52a55695d7bd7ac727641cab/tools/train.py#L57-L61
好的,非常感谢耐心指导,已经能跑起来BSDS数据集的训练了。 还有一个问题请教下,就是我自己的数据集按照BSDS格式构建的,但是test只有图片,没有.mat文件,训练的时候报错 File "/EDTER/mmseg/datasets/builder.py", line 73, in build_dataset dataset = build_from_cfg(cfg, DATASETS, default_args) File "/lib/python3.9/site-packages/mmcv/utils/registry.py", line 55, in build_from_cfg raise type(e)(f'{obj_cls.name}: {e}') ValueError: BSDSDataset: not enough values to unpack (expected 2, got 1) 是不是必须要有.mat文件,这个文件是怎么生成的呢?如果不生成,需要改什么地方才能跑起来咱的代码呢?
您好,不是非要有.mat文件,只是为了占位,只要在test.txt里有这么一列就可以,并不会真的读取到这个数据。
您好,感谢您的回复!我按照格式只在test.txt中加了一列.mat,可以训练起来,但在10000 iterations结束时,报错 FileNotFoundError: [Errno 2] No such file or directory: '/test/10000.mat'
您好,具体报错如下: 2023-11-23 01:26:30,391 - mmseg - INFO - Iter [9980/80000] lr: 8.881e-07, eta: 8:02:39, time: 0.363, data_time: 0.002, memory: 29987, decode.loss_seg: 199.0960, aux_0.loss_seg: 89.2975, aux_1.loss_seg: 84.7263, aux_2.loss_seg: 84.0529, aux_3.loss_seg: 83.4825, aux_4.loss_seg: 83.7665, aux_5.loss_seg: 84.4057, aux_6.loss_seg: 85.0462, aux_7.loss_seg: 85.2248, loss: 879.0984 2023-11-23 01:26:37,711 - mmseg - INFO - Saving checkpoint at 10000 iterations [>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 148/148, 6.9 task/s, elapsed: 21s, ETA: 0sTraceback (most recent call last): File "/home/EDTER/./tools/train.py", line 168, in main() File "/home/EDTER/./tools/train.py", line 157, in main train_segmentor( File "/home/EDTER/mmseg/apis/train.py", line 108, in train_segmentor runner.run(data_loaders, cfg.workflow, cfg.total_iters) File "/home/edgedetection/lib/python3.9/site-packages/mmcv/runner/iter_based_runner.py", line 134, in run iter_runner(iter_loaders[i], **kwargs) File "/home/edgedetection/lib/python3.9/site-packages/mmcv/runner/iter_based_runner.py", line 67, in train self.call_hook('after_train_iter') File "/home/edgedetection/lib/python3.9/site-packages/mmcv/runner/base_runner.py", line 309, in call_hook getattr(hook, fn_name)(self) File "/home/EDTER/mmseg/core/evaluation/eval_hooks.py", line 30, in after_train_iter self.evaluate(runner, results) File "/home/EDTER/mmseg/core/evaluation/eval_hooks.py", line 34, in evaluate eval_res = self.dataloader.dataset.evaluate( File "/home/EDTER/mmseg/datasets/custom.py", line 335, in evaluate gt_seg_maps = self.get_gt_seg_maps() File "/home/EDTER/mmseg/datasets/custom.py", line 235, in get_gt_seg_maps gt_seg_map = mmcv.imread( File "/home/edgedetection/lib/python3.9/site-packages/mmcv/image/io.py", line 203, in imread img_bytes = file_client.get(img_or_path) File "/home/edgedetection/lib/python3.9/site-packages/mmcv/fileio/file_client.py", line 993, in get return self.client.get(filepath) File "/home/edgedetection/lib/python3.9/site-packages/mmcv/fileio/file_client.py", line 518, in get with open(filepath, 'rb') as f: FileNotFoundError: [Errno 2] No such file or directory: '/home/EDTER/data/f_v0/test/10000.mat' ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3467) of binary: /home/edgedetection/bin/python Traceback (most recent call last): File "/home/edgedetection/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/edgedetection/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/edgedetection/lib/python3.9/site-packages/torch/distributed/launch.py", line 193, in main() File "/home/edgedetection/lib/python3.9/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/home/edgedetection/lib/python3.9/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/home/edgedetection/lib/python3.9/site-packages/torch/distributed/run.py", line 710, in run elastic_launch( File "/home/edgedetection/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/edgedetection/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ./tools/train.py FAILED
请把这一行代码注释掉
因为我用的是分布式,所以没有检查不用分布式的流程,感谢您提供的细节和反馈。
解决方案: 1,还是请尝试使用分布式 2,如果坚持不使用分布式,测试的时候调用的是这个函数https://github.com/MengyangPu/EDTER/blob/3fe76f3d938206ef9dc8b857a9767b8cd3d28fc7/mmseg/apis/test.py#L16 该函数并不适用于EDTER的测试。 可以参考分布式调用的测试函数进行修改https://github.com/MengyangPu/EDTER/blob/3fe76f3d938206ef9dc8b857a9767b8cd3d28fc7/mmseg/apis/test.py#L73 另外,还需要修改这个函数:https://github.com/MengyangPu/EDTER/blob/3fe76f3d938206ef9dc8b857a9767b8cd3d28fc7/mmseg/core/evaluation/eval_hooks.py#L23-L29 请参考分布式调用的函数: https://github.com/MengyangPu/EDTER/blob/3fe76f3d938206ef9dc8b857a9767b8cd3d28fc7/mmseg/core/evaluation/eval_hooks.py#L68-L81
解决方案: 1,还是请尝试使用分布式 2,如果坚持不使用分布式,测试的时候调用的是这个函数
该函数并不适用于EDTER的测试。 可以参考分布式调用的测试函数进行修改 https://github.com/MengyangPu/EDTER/blob/3fe76f3d938206ef9dc8b857a9767b8cd3d28fc7/mmseg/apis/test.py#L73
另外,还需要修改这个函数: https://github.com/MengyangPu/EDTER/blob/3fe76f3d938206ef9dc8b857a9767b8cd3d28fc7/mmseg/core/evaluation/eval_hooks.py#L23-L29
./tools/train.py FAILED
File "/home/edgedetection/lib/python3.9/site-packages/mmcv/runner/hooks/logger/text.py", line 154, in _log_info f'data_time: {log_dict["data_time"]:.3f}, ' KeyError: 'data_time'
你好,看你的描述应该是mmcv的问题: File "/home/edgedetection/lib/python3.9/site-packages/mmcv/runner/hooks/logger/text.py", line 154, in _log_info f'data_time: {log_dict["data_time"]:.3f}, ' KeyError: 'data_time' 非常抱歉,我没有遇到类似的问题,用的环境是: python=3.7 ,mmcv-full==1.2.2 我找了几个类似的问题和回答希望对你有帮助:https://github.com/open-mmlab/mmsegmentation/issues/1502 和 https://github.com/lhoyer/DAFormer/issues/7。
File "/home/edgedetection/lib/python3.9/site-packages/mmcv/runner/hooks/logger/text.py", line 154, in _log_info f'data_time: {log_dict["data_time"]:.3f}, ' KeyError: 'data_time'
你好,看你的描述应该是mmcv的问题: File "/home/edgedetection/lib/python3.9/site-packages/mmcv/runner/hooks/logger/text.py", line 154, in _log_info f'data_time: {log_dict["data_time"]:.3f}, ' KeyError: 'data_time' 非常抱歉,我没有遇到类似的问题,用的环境是: python=3.7 ,mmcv-full==1.2.2 我找了几个类似的问题和回答希望对你有帮助:open-mmlab/mmsegmentation#1502 和 https://github.com/lhoyer/DAFormer/issues/7。
您好,感谢您提供的建议,按照建议修改后已经跑通了全局阶段的训练,但在训练局部模型的时候报错了,能否帮忙看下是什么问题,感谢: File "/home/EDTER/mmseg/models/decode_heads/local8x8_fuse_head.py", line 36, in forward fuse_features = local_features * (scale+1) +shift RuntimeError: The size of tensor a (624) must match the size of tensor b (625) at non-singleton dimension 3 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2981545) of binary: /home/edgedetection/bin/python
@Snailgoo 你好,建议用debug的方式,定位到 "/home/EDTER/mmseg/models/decode_heads/local8x8_fuse_head.py", line 36,查看一下各个特征的shape是否一致。
@Snailgoo 你好,建议用debug的方式,定位到 "/home/EDTER/mmseg/models/decode_heads/local8x8_fuse_head.py", line 36,查看一下各个特征的shape是否一致。
您好,我打印了下=====local_features=====global_features=====,训练的时候分别为torch.Size([4, 128, 320, 320]) torch.Size([4, 128, 320, 320]);但到了checkpoint验证的时候变成了 torch.Size([1, 128, 456, 624]) torch.Size([1, 128, 461, 625]),也就是global_features变成了456, 624,与local_features 461, 625不一致了
@Snailgoo 请确认验证时是否采用test_cfg = dict(mode='slide', crop_size=(320, 320), stride=(280, 280)) 以及输入图像的大小是否是10的倍数,因为解码器是由反卷积构成的 请继续采用debug模式查看该函数:https://github.com/MengyangPu/EDTER/blob/3fe76f3d938206ef9dc8b857a9767b8cd3d28fc7/mmseg/models/segmentors/encoder_decoder_local8x8.py#L320
模型对显存要求太高了,尤其是StageⅡ,建议租显卡
速度曼一点也没有关系,目前我测试了,确实DETER是sota的结果,泛化能力也很强
您好,在训练阶段,如果每块GPU设置训练1张320x320的图像,每张GPU需要15G;如果每块GPU设置训练4张320x320的图像,每张GPU需要越25G。如果是3090,可能需要四张卡。训练第二阶段时,第一阶段的参数是固定的,因此需要的显存略有下降。
十分抱歉打扰您,我按照您的readme中训练stage1的流程走了一遍,想要自己训练复现以下您的工作,但是在运行dist_train.py命令脚本的时候报错了,错误如下: ValueError: Unsupported nproc_per_node value: configs/bsds/EDTER_BIMLA_320x320_80k_bsds_aug_bs_8.py 使用的命令: cd EDTER bash ./tools/dist_train.sh configs/bsds/EDTER_BIMLA_320x320_80k_bsds_bs_8.py 1 我设置了GPU数量为1,因为我只有单卡 想向您咨询一下针对当前问题,我是否需要更改dist_train.sh中的命令代码从而使训练可以正常进行 希望得到您的回复 谢谢
模型对显存要求太高了,尤其是StageⅡ,建议租显卡
速度曼一点也没有关系,目前我测试了,确实DETER是sota的结果,泛化能力也很强
您好,在训练阶段,如果每块GPU设置训练1张320x320的图像,每张GPU需要15G;如果每块GPU设置训练4张320x320的图像,每张GPU需要越25G。如果是3090,可能需要四张卡。训练第二阶段时,第一阶段的参数是固定的,因此需要的显存略有下降。
十分抱歉打扰您,我按照您的readme中训练stage1的流程走了一遍,想要自己训练复现以下您的工作,但是在运行dist_train.py命令脚本的时候报错了,错误如下: ValueError: Unsupported nproc_per_node value: configs/bsds/EDTER_BIMLA_320x320_80k_bsds_aug_bs_8.py 使用的命令: cd EDTER bash ./tools/dist_train.sh configs/bsds/EDTER_BIMLA_320x320_80k_bsds_bs_8.py 1 我设置了GPU数量为1,因为我只有单卡 想向您咨询一下针对当前问题,我是否需要更改dist_train.sh中的命令代码从而使训练可以正常进行 希望得到您的回复 谢谢
模型对显存要求太高了,尤其是StageⅡ,建议租显卡
速度曼一点也没有关系,目前我测试了,确实DETER是sota的结果,泛化能力也很强
您好,在训练阶段,如果每块GPU设置训练1张320x320的图像,每张GPU需要15G;如果每块GPU设置训练4张320x320的图像,每张GPU需要越25G。如果是3090,可能需要四张卡。训练第二阶段时,第一阶段的参数是固定的,因此需要的显存略有下降。
十分抱歉打扰您,我按照您的readme中训练stage1的流程走了一遍,想要自己训练复现以下您的工作,但是在运行dist_train.py命令脚本的时候报错了,错误如下: ValueError: Unsupported nproc_per_node value: configs/bsds/EDTER_BIMLA_320x320_80k_bsds_aug_bs_8.py 使用的命令: cd EDTER bash ./tools/dist_train.sh configs/bsds/EDTER_BIMLA_320x320_80k_bsds_bs_8.py 1 我设置了GPU数量为1,因为我只有单卡 想向您咨询一下针对当前问题,我是否需要更改dist_train.sh中的命令代码从而使训练可以正常进行 希望得到您的回复 谢谢
模型对显存要求太高了,尤其是StageⅡ,建议租显卡
速度曼一点也没有关系,目前我测试了,确实DETER是sota的结果,泛化能力也很强
您好,在训练阶段,如果每块GPU设置训练1张320x320的图像,每张GPU需要15G;如果每块GPU设置训练4张320x320的图像,每张GPU需要越25G。如果是3090,可能需要四张卡。训练第二阶段时,第一阶段的参数是固定的,因此需要的显存略有下降。
十分抱歉打扰您,我按照您的readme中训练stage1的流程走了一遍,想要自己训练复现以下您的工作,但是在运行dist_train.py命令脚本的时候报错了,错误如下: ValueError: Unsupported nproc_per_node value: configs/bsds/EDTER_BIMLA_320x320_80k_bsds_aug_bs_8.py 使用的命令: cd EDTER bash ./tools/dist_train.sh configs/bsds/EDTER_BIMLA_320x320_80k_bsds_bs_8.py 1 我设置了GPU数量为1,因为我只有单卡 想向您咨询一下针对当前问题,我是否需要更改dist_train.sh中的命令代码从而使训练可以正常进行 希望得到您的回复 谢谢
一般是需要根据自己的硬件调整一下代码。目前公开的代码应该是多卡训练的代码,您可以花一些时间查阅相关资料自己尝试解决这个问题哈。
On Sat, 22 Jun 2024 at 17:27, RainYu @.***> wrote:
模型对显存要求太高了,尤其是StageⅡ,建议租显卡
速度曼一点也没有关系,目前我测试了,确实DETER是sota的结果,泛化能力也很强
您好,在训练阶段,如果每块GPU设置训练1张320x320的图像,每张GPU需要15G;如果每块GPU设置训练4张320x320的图像,每张GPU需要越25G。如果是3090,可能需要四张卡。训练第二阶段时,第一阶段的参数是固定的,因此需要的显存略有下降。
十分抱歉打扰您,我按照您的readme中训练stage1的流程走了一遍,想要自己训练复现以下您的工作,但是在运行dist_train.py 命令脚本的时候报错了,错误如下: ValueError: Unsupported nproc_per_node value: configs/bsds/EDTER_BIMLA_320x320_80k_bsds_aug_bs_8.py 使用的命令:
cd EDTER bash ./tools/dist_train.sh configs/bsds/EDTER_BIMLA_320x320_80k_bsds_bs_8.py 1 我设置了GPU数量为1,因为我只有单卡 想向您咨询一下针对当前问题,我是否需要更改dist_train.sh中的命令代码从而使训练可以正常进行 希望得到您的回复 谢谢
— Reply to this email directly, view it on GitHub https://github.com/MengyangPu/EDTER/issues/32#issuecomment-2183958848, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALF3JCNHL7CGQ2Z3U6NJBUTZIU7QRAVCNFSM6AAAAABJXJEMUSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBTHE2TQOBUHA . You are receiving this because you commented.Message ID: @.***>
Such as the title....
我严格按照作者的readme文档进行了测试,包括项目文件结构
config中的shell 和tools/test.py 都无法成功运行