cycler2r程序代码环境问题

Chegva commented 1 year ago

请问你们用的是python多少版本呢，我现在跑train.py一直报环境的错误

Chegva commented 1 year ago

        dict(type='RAWNormalize', blc=0, saturate=1024, key='img_raw'),
        dict(
            type='LoadImageFromFile',
            io_backend='disk',
            key='img_rgb',
            flag='color'),
        dict(
            type='Resize',
            keys=['img_raw', 'img_rgb'],
            scale=(1280, 960),
            interpolation='bicubic'),
        dict(type='RescaleToZeroOne', keys=['img_rgb']),
        dict(
            type='Normalize',
            keys=['img_rgb'],
            to_rgb=True,
            mean=[0, 0, 0],
            std=[1, 1, 1]),
        dict(type='ImageToTensor', keys=['img_raw', 'img_rgb']),
        dict(
            type='Collect',
            keys=['img_raw', 'img_rgb'],
            meta_keys=['img_raw_path', 'img_rgb_path'])
    ]))

DATASET = 'bdd100k' exp_name = 'unpaired_cycler2r_bdd100k_rgb2oneplus_raw' work_dir = './work_dirs/experiments/unpaired_cycler2r_bdd100k_rgb2oneplus_raw' gpu_ids = range(0, 1)

2023-01-16 15:39:48,774 - mmgen - INFO - Set random seed to 2021, deterministic: False fatal: ambiguous argument 'HEAD': unknown revision or path not in the working tree. Use '--' to separate paths from revisions, like this: 'git [...] -- [...]' Traceback (most recent call last): File "train.py", line 173, in main() File "train.py", line 169, in main meta=meta) File "/opt/conda/envs/yzh/lib/python3.7/site-packages/mmgen/apis/train.py", line 104, in train_model model.cuda(cfg.gpu_ids[0]), device_ids=cfg.gpu_ids) File "/opt/conda/envs/yzh/lib/python3.7/site-packages/torch/nn/modules/module.py", line 638, in cuda return self._apply(lambda t: t.cuda(device)) File "/opt/conda/envs/yzh/lib/python3.7/site-packages/torch/nn/modules/module.py", line 530, in _apply module._apply(fn) File "/opt/conda/envs/yzh/lib/python3.7/site-packages/torch/nn/modules/module.py", line 530, in _apply module._apply(fn) File "/opt/conda/envs/yzh/lib/python3.7/site-packages/torch/nn/modules/module.py", line 530, in _apply module._apply(fn) [Previous line repeated 1 more time] File "/opt/conda/envs/yzh/lib/python3.7/site-packages/torch/nn/modules/module.py", line 552, in _apply param_applied = fn(param) File "/opt/conda/envs/yzh/lib/python3.7/site-packages/torch/nn/modules/module.py", line 638, in return self._apply(lambda t: t.cuda(device)) RuntimeError: CUDA error: out of memory (yzh) root@770dce876de1:/workspace/rho-vision-main#

Chegva commented 1 year ago

上面这里一直报CUDA error: out of memory，但显存确实还有很多，实在查不出什么问题，能帮忙看看吗

lizhihao6 commented 1 year ago

您好，感谢您的关注，请问您的 GPU 配置是？可以尝试在 config 里减小 batch size

samples_per_gpu=2,
workers_per_gpu=4,

Chegva commented 1 year ago

解决了，现在单卡模式下能跑了

Chegva commented 1 year ago

我在尝试用多卡训练时，设置如下： parser = argparse.ArgumentParser(description='Train a GAN model') parser.add_argument('--config',default='/workspace/rho-vision-main/configs/unpaired_cycler2r/unpaired_cycler2r_in_bdd100k_rgb2oneplus_raw_20k.py', help='train config file path') parser.add_argument('--work-dir', help='the dir to save logs and models') parser.add_argument( '--resume-from',help='the checkpoint file to resume from') parser.add_argument( '--no-validate', action='store_true', help='whether not to evaluate the checkpoint during training') group_gpus = parser.add_mutually_exclusive_group() group_gpus.add_argument( '--gpus', default=5, type=int, help='number of gpus to use ' '(only applicable to non-distributed training)') group_gpus.add_argument( '--gpu-ids', default=[0,1,2,3,4], type=int, nargs='+', help='ids of gpus to use ' '(only applicable to non-distributed training)') parser.add_argument('--seed', type=int, default=2021, help='random seed') parser.add_argument( '--deterministic', action='store_true', help='whether to set deterministic options for CUDNN backend.') parser.add_argument( '--cfg-options', nargs='+', action=DictAction, help='override some settings in the used config, the key-value pair ' 'in xxx=yyy format will be merged into config file.') parser.add_argument( '--launcher', choices=['none', 'pytorch', 'slurm', 'mpi'], default='pytorch', help='job launcher') parser.add_argument('--local_rank', type=int, default=0) args = parser.parse_args() if 'LOCAL_RANK' not in os.environ: os.environ['LOCAL_RANK'] = str(args.local_rank) 最开始，遇到了KeyError: 'RANK'，然后我按网上的方法加了os.environ['RANK']='0'。

之后又遇到ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable WORLD_SIZE expected, but not set，我参照之前的做法，加上了os.environ['WORLD_SIZE']='5'，因为我用了os.environ['CUDA_VISIBLE_DEVICES'] = '1,3,4,5,6'命令指定了5张卡。

最后又遇到了ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable MASTER_ADDR expected, but not set，依然按网上的办法，加了os.environ['MASTER_ADDR'] = 'localhost' 和 os.environ['MASTER_PORT'] = '5678' 这两个命令。

结果是程序是不报错了，但运行后，终端一直没有任何输出，似乎卡住了，请问你那边有遇到过这种情况吗，还是说我的设置方法存在什么问题呢，可以的话，望告知

lizhihao6 commented 1 year ago

您好，关于多卡训练，您可以参考 mmgeneration，但我们并没有进行过测试。

NJUVISION / rho-vision

cycler2r程序代码环境问题 #3