Process Group Did Not Initialized when Multi-GPU Training

billzhonggz commented 2 years ago

I tried to run the code by tools/train_net.py script, with the config configs/baselines-neurips/dicom-track/seg/unet.yaml. Since I don't have one GPU with 24GB+ VRAM, I tried to run the code on 2 GPUs with 12GB VRAM each.

I set the argument --num-gpus=2 when starting the program. But I got an error about "process group not initialized" when the program ran to build_sampler.

https://github.com/StanfordMIMI/skm-tea/blob/58ec1454c989c838a25956900252ea4b6eb383dd/skm_tea/data/data_module.py#L78

Here is the trace stack.

Traceback (most recent call last):
  ...(ignore debugger stack)
  File "/path/to/skm-tea/tools/train_net.py", line 139, in <module>
    main(args)
  File "/path/to/skm-tea/tools/train_net.py", line 99, in main
    model = pl_module(cfg, num_parallel=num_gpus, eval_on_cpu=args.eval_on_cpu)
  File "/path/to/skm-tea/skm_tea/engine/modules/module.py", line 243, in __init__
    super().__init__(cfg, num_parallel, eval_on_cpu=eval_on_cpu, **kwargs)
  File "/path/to/skm-tea/skm_tea/engine/modules/module.py", line 38, in __init__
    super().__init__(
  File "/path/to/skm-tea/skm_tea/engine/modules/base.py", line 56, in __init__
    data_loader = self.train_dataloader(cfg)
  File "/path/to/skm-tea/skm_tea/engine/modules/module.py", line 60, in train_dataloader
    return datamodule.train_dataloader(cfg, self.distributed)
  File "/path/to/skm-tea/skm_tea/data/data_module.py", line 78, in train_dataloader
    sampler, is_batch_sampler = build_train_sampler(cfg, dataset, distributed=use_ddp)
  File "/home/junru/miniconda3/envs/pytorch/lib/python3.10/site-packages/meddlr/data/samplers/build.py", line 63, in build_train_sampler
    sampler = DistributedSampler(dataset, shuffle=True) if distributed else None
  File "/home/junru/miniconda3/envs/pytorch/lib/python3.10/site-packages/torch/utils/data/distributed.py", line 67, in __init__
    num_replicas = dist.get_world_size()
  File "/home/junru/miniconda3/envs/pytorch/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 867, in get_world_size
    return _get_group_size(group)
  File "/home/junru/miniconda3/envs/pytorch/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 325, in _get_group_size
    default_pg = _get_default_group()
  File "/home/junru/miniconda3/envs/pytorch/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 429, in _get_default_group
    raise RuntimeError(
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

To be clear, I changed many lines from the source code to skip some exceptions that I believe come from the API change. They are about logging and profiling. I think those changes should not relate to multiprocessing.

My environment and configuration dump is,

[09/19 15:07:11] skm_tea INFO: Running with pytorch lightning
[09/19 15:07:11] skm_tea INFO: Rank of current process: 0. World size: 1
[09/19 15:07:13] skm_tea INFO: Environment info:
----------------------  ------------------------------------------------------------------------------------
sys.platform            linux
Python                  3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0]
numpy                   1.22.0
PyTorch                 1.12.1 @/home/junru/miniconda3/envs/pytorch/lib/python3.10/site-packages/torch
PyTorch debug build     False
CUDA available          True
GPU 0                   NVIDIA TITAN Xp
GPU 1                   Quadro M6000
CUDA_HOME               /usr/local/cuda
NVCC                    Build cuda_11.7.r11.7/compiler.31442593_0
Pillow                  9.2.0
torchvision             0.13.1 @/home/junru/miniconda3/envs/pytorch/lib/python3.10/site-packages/torchvision
torchvision arch flags  sm_35, sm_50, sm_60, sm_70, sm_75, sm_80, sm_86
SLURM_JOB_ID            slurm not detected
cv2                     4.5.5
----------------------  ------------------------------------------------------------------------------------
PyTorch built with:
  - GCC 9.3
  - C++ Version: 201402
  - Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.3
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
  - CuDNN 8.3.2  (built against CUDA 11.5)
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.3.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.12.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,

And the full config from the log file is,

# Config for U-Net as implemented in MedSegPy.
# Implementation based on:
#   - Desai et al. Technical considerations for semantic segmentation in MRI. ArXiv 2019.
#   - Desai et al. "Deep learning for medical image segmentation". MICCAI 2018.
MODEL:
  TASKS: ("sem_seg",)
  PARAMETERS:
    INIT:
    - 
      kind: "conv"
      patterns: (".*weight", ".*bias")
      initializers: (("kaiming_normal_", {"nonlinearity":"relu"}), "zeros_")
    - 
      kind: "norm"
      patterns: (".*weight", ".*bias")
      initializers: ("ones_", "zeros_")
    -
      patterns: ("output_block\.weight",)
      initializers: ("xavier_uniform_",)
  META_ARCHITECTURE: "GeneralizedUNet"
  UNET:
    CHANNELS: 32
    NUM_POOL_LAYERS: 5
    DROPOUT: 0.0
    BLOCK_ORDER: ("conv", "relu", "conv", "relu", "bn")
  SEG:
    LOSS_NAME: "FlattenedDiceLoss"
    CLASSES: ("pc", "fc", "men", "tc")
DATASETS:
  TRAIN: ("skmtea_v1_train",)
  VAL: ("skmtea_v1_val",)
  TEST: ("skmtea_v1_test",)
  QDESS:
    ECHO_KIND: "echo1"  # This must be specified - one of ("echo1", "echo2", "echo1-echo2-mc", "rss")
    DATASET_TYPE: "qDESSImageDataset"
    KWARGS: ("orientation", "sagittal")
DATALOADER:
  NUM_WORKERS: 8
  GROUP_SAMPLER:
    BATCH_BY: ("inplane_shape",)
    AS_BATCH_SAMPLER: True
SOLVER:
  OPTIMIZER: "Adam"
  LR_SCHEDULER_NAME: "StepLR"
  GAMMA: 0.9
  STEPS: (2,)  # drop by 0.9x every 2 epochs
  BASE_LR: 1e-3
  MIN_LR: 1e-8
  TRAIN_BATCH_SIZE: 16
  TEST_BATCH_SIZE: 16
  MAX_ITER: 100
  WEIGHT_DECAY: 0.
  CHECKPOINT_PERIOD: 1
  EARLY_STOPPING:
    MONITOR: "val_loss"
    PATIENCE: 12
    MIN_DELTA: 1e-5
DESCRIPTION:
  BRIEF: f"UNet segmentation following parameters used in MedSegPy - input={DATASETS.QDESS.ECHO_KIND}, {SOLVER.MAX_ITER} epochs, {SOLVER.BASE_LR} lr w/ {SOLVER.GAMMA}x decay every {SOLVER.STEPS} epochs, early stopping- T={SOLVER.EARLY_STOPPING.PATIENCE}, delta={SOLVER.EARLY_STOPPING.MIN_DELTA}, bsz={SOLVER.TRAIN_BATCH_SIZE}, qdess args={DATASETS.QDESS.KWARGS}"
  PROJECT_NAME: "skm-tea"
  ENTITY_NAME: "billzhonggz"
  EXP_NAME: f"seg-baseline/unet-medsegpy-{DATASETS.QDESS.ECHO_KIND}-seed={SEED}"
  TAGS: ("seg-baseline", "baseline", "unet-medsegpy", "neurips")
TEST:
  EVAL_PERIOD: 1
  VAL_METRICS:
    SEM_SEG: ("DSC","VOE","CV","DSC_scan","VOE_scan","CV_scan")
  FLUSH_PERIOD: -5
VIS_PERIOD: -100
TIME_SCALE: "epoch"
OUTPUT_DIR: f"results://skm-tea/seg-baseline/unet-{DATASETS.QDESS.ECHO_KIND}-seed={SEED}"
SEED: 9001
VERSION: 1

[09/19 15:07:13] skm_tea INFO: Running with full config:

AUG_TEST:
  UNDERSAMPLE:
    ACCELERATIONS: (6,)
AUG_TRAIN:
  NOISE_P: 0.2
  UNDERSAMPLE:
    ACCELERATIONS: (6,)
    CALIBRATION_SIZE: 24
    CENTER_FRACTIONS: ()
    MAX_ATTEMPTS: 5
    NAME: PoissonDiskMaskFunc
    PRECOMPUTE:
      NUM: -1
      SEED: -1
      USE_MULTIPROCESSING: False
  USE_NOISE: False
CUDNN_BENCHMARK: False
DATALOADER:
  ALT_SAMPLER:
    PERIOD_SUPERVISED: 1
    PERIOD_UNSUPERVISED: 1
  DATA_KEYS: ()
  DROP_LAST: True
  FILTER:
    BY: ()
  GROUP_SAMPLER:
    AS_BATCH_SAMPLER: True
    BATCH_BY: ('inplane_shape',)
  NUM_WORKERS: 8
  PREFETCH_FACTOR: 2
  SAMPLER_TRAIN: 
  SUBSAMPLE_TRAIN:
    NUM_TOTAL: -1
    NUM_TOTAL_BY_GROUP: ()
    NUM_UNDERSAMPLED: 0
    NUM_VAL: -1
    NUM_VAL_BY_GROUP: ()
    SEED: 1000
DATASETS:
  QDESS:
    DATASET_TYPE: qDESSImageDataset
    ECHO_KIND: echo1
    KWARGS: ('orientation', 'sagittal')
  TEST: ('skmtea_v1_test',)
  TRAIN: ('skmtea_v1_train',)
  VAL: ('skmtea_v1_val',)
DESCRIPTION:
  BRIEF: UNet segmentation following parameters used in MedSegPy - input=echo1, 100 epochs, 0.001 lr w/ 0.9x decay every 2 epochs, early stopping- T=12, delta=1e-05, bsz=16, qdess args=orientation-sagittal
  ENTITY_NAME: billzhonggz
  EXP_NAME: seg-baseline/unet-medsegpy-echo1-seed=9001
  PROJECT_NAME: skm-tea
  TAGS: ('seg-baseline', 'baseline', 'unet-medsegpy', 'neurips')
MODEL:
  CASCADE:
    ITFS:
      PERIOD: 0
    RECON_MODEL_NAME: 
    SEG_MODEL_NAME: 
    SEG_NORMALIZE: 
    USE_MAGNITUDE: False
    ZERO_FILL: False
  CS:
    MAX_ITER: 200
    REGULARIZATION: 0.005
  DENOISING:
    META_ARCHITECTURE: GeneralizedUnrolledCNN
    NOISE:
      STD_DEV: (1,)
      USE_FULLY_SAMPLED_TARGET: True
      USE_FULLY_SAMPLED_TARGET_EVAL: None
  DEVICE: cuda
  META_ARCHITECTURE: GeneralizedUNet
  N2R:
    META_ARCHITECTURE: GeneralizedUnrolledCNN
    USE_SUPERVISED_CONSISTENCY: False
  NORMALIZER:
    KEYWORDS: ()
    NAME: TopMagnitudeNormalizer
  PARAMETERS:
    INIT: ({'kind': 'conv', 'patterns': '(".*weight", ".*bias")', 'initializers': '(("kaiming_normal_", {"nonlinearity":"relu"}), "zeros_")'}, {'kind': 'norm', 'patterns': '(".*weight", ".*bias")', 'initializers': '("ones_", "zeros_")'}, {'patterns': '("output_block\\.weight",)', 'initializers': '("xavier_uniform_",)'})
    USE_COMPLEX_WEIGHTS: False
  RECON_LOSS:
    NAME: l1
    RENORMALIZE_DATA: True
    WEIGHT: 1.0
  SEG:
    ACTIVATION: sigmoid
    CLASSES: ('pc', 'fc', 'men', 'tc')
    INCLUDE_BACKGROUND: False
    IN_CHANNELS: None
    LOSS_NAME: FlattenedDiceLoss
    LOSS_WEIGHT: 1.0
    MODEL:
      DYNUNET_MONAI:
        DEEP_SUPERVISION: False
        DEEP_SUPR_NUM: 1
        KERNEL_SIZE: (3,)
        NORM_NAME: instance
        RES_BLOCK: False
        STRIDES: (1,)
        UPSAMPLE_KERNEL_SIZE: (2,)
      UNET_MONAI:
        ACTIVATION: ('prelu', {})
        CHANNELS: ()
        DROPOUT: 0.0
        KERNEL_SIZE: (3,)
        NORM: ('instance', {})
        NUM_RES_UNITS: 0
        STRIDES: ()
        UP_KERNEL_SIZE: (3,)
      VNET_MONAI:
        ACTIVATION: ('elu', {'inplace': True})
        DROPOUT_DIM: 2
        DROPOUT_PROB: 0.5
    USE_MAGNITUDE: True
  TASKS: ('sem_seg',)
  TB_RECON:
    CHANNELS: (16, 32, 64)
    DEC_NUM_CONV_BLOCKS: (2, 3)
    ENC_NUM_CONV_BLOCKS: (1, 2, 3)
    KERNEL_SIZE: (5,)
    MULTI_CONCAT: ()
    ORDER: ('conv', 'relu')
    STRIDES: (2,)
    USE_MAGNITUDE: False
  UNET:
    BLOCK_ORDER: ('conv', 'relu', 'conv', 'relu', 'bn')
    CHANNELS: 32
    DROPOUT: 0.0
    IN_CHANNELS: 2
    NORMALIZE: False
    NUM_POOL_LAYERS: 5
    OUT_CHANNELS: 2
  UNROLLED:
    BLOCK_ARCHITECTURE: ResNet
    CONV_BLOCK:
      ACTIVATION: relu
      NORM: none
      NORM_AFFINE: False
      ORDER: ('norm', 'act', 'drop', 'conv')
    DROPOUT: 0.0
    FIX_STEP_SIZE: False
    KERNEL_SIZE: (3,)
    NUM_EMAPS: 1
    NUM_FEATURES: 256
    NUM_RESBLOCKS: 2
    NUM_UNROLLED_STEPS: 5
    PADDING: 
    SHARE_WEIGHTS: False
    STEP_SIZES: (-2.0,)
  WEIGHTS: 
OUTPUT_DIR: ./results/skm-tea/seg-baseline/unet-echo1-seed=9001/version_001
SEED: 9001
SOLVER:
  BASE_LR: 0.001
  BIAS_LR_FACTOR: 1.0
  CHECKPOINT_MONITOR: val_loss
  CHECKPOINT_PERIOD: 1
  EARLY_STOPPING:
    MIN_DELTA: 1e-05
    MONITOR: val_loss
    PATIENCE: 12
  GAMMA: 0.9
  GRAD_ACCUM_ITERS: 1
  LR_SCHEDULER_NAME: StepLR
  MAX_ITER: 100
  MIN_LR: 1e-08
  MOMENTUM: 0.9
  OPTIMIZER: Adam
  STEPS: (2,)
  TEST_BATCH_SIZE: 16
  TRAIN_BATCH_SIZE: 16
  WARMUP_FACTOR: 0.001
  WARMUP_ITERS: 1000
  WARMUP_METHOD: linear
  WEIGHT_DECAY: 0.0
  WEIGHT_DECAY_BIAS: 0.0001
  WEIGHT_DECAY_NORM: 0.0
TEST:
  EVAL_PERIOD: 1
  EXPECTED_RESULTS: []
  FLUSH_PERIOD: -5
  QDESS_EVALUATOR:
    ADDITIONAL_PATHS: ()
  VAL_METRICS:
    RECON: ()
    SEM_SEG: ('DSC', 'VOE', 'CV', 'DSC_scan', 'VOE_scan', 'CV_scan')
TIME_SCALE: epoch
VERSION: 1
VIS_PERIOD: -100

I did some study on the source code of this repository and meddlr. To my understanding, the process group should be initialized in skm_tea/engine/modules/base.py file. But I saw many TODOs in this file about multiprocessing.

I am trying hard to fix the bug by adding init_process_group() to the file mentioned above. Would you also investigate the issue or suggest me to build an environment that guarantee to work?

aldiak commented 2 years ago

Hi, have you found a solution to your problem, I am facing similar issues related to multiprocessing. I am windows fatal exception: access violation

billzhonggz commented 2 years ago

Hi, have you found a solution to your problem, I am facing similar issues related to multiprocessing. I am windows fatal exception: access violation

I bypassed the issue by using a GPU with a larger VRAM. I tried to fix it before but it was too much work.

aldiak commented 2 years ago

Alright, if possible can you share your data loading modules? My program will load the train data, but while preconputing the masks, it will throw an access violation error related to multiprocessing.

On Wed, Oct 19, 2022 at 23:09 Junru Zhong @.***> wrote:

Hi, have you found a solution to your problem, I am facing similar issues related to multiprocessing. I am windows fatal exception: access violation

I bypassed the issue by using a GPU with a larger VRAM. I tried to fix it before but it was too much work.

— Reply to this email directly, view it on GitHub https://github.com/StanfordMIMI/skm-tea/issues/20#issuecomment-1284168000, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFOVODUEL7J57ZSTHYYVYGLWEAFKDANCNFSM6AAAAAAQP2YEC4 . You are receiving this because you commented.Message ID: @.***>

aldiak commented 2 years ago

Here is the error I am getting

Le mer. 19 oct. 2022 à 23:09, Junru Zhong @.***> a écrit :

Hi, have you found a solution to your problem, I am facing similar issues related to multiprocessing. I am windows fatal exception: access violation

I bypassed the issue by using a GPU with a larger VRAM. I tried to fix it before but it was too much work.

— Reply to this email directly, view it on GitHub https://github.com/StanfordMIMI/skm-tea/issues/20#issuecomment-1284168000, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFOVODUEL7J57ZSTHYYVYGLWEAFKDANCNFSM6AAAAAAQP2YEC4 . You are receiving this because you commented.Message ID: @.***>

aldiak commented 2 years ago

Hi, have you found a solution to your problem, I am facing similar issues related to multiprocessing. I am windows fatal exception: access violation

I bypassed the issue by using a GPU with a larger VRAM. I tried to fix it before but it was too much work.

Hi, here my error message:

2022-10-20 09:32:17,513 - Formatting dataset dicts takes 0.05 seconds 2022-10-20 09:32:17,514 - Dropped 0 scans. 86 scans remaining 2022-10-20 09:32:17,515 - Dropped references for 0/86 scans. 86 scans with reference remaining 2022-10-20 09:32:18,152 - Loading D:/files_recon_calib-24/annotations\val.json takes 0.00 seconds 2022-10-20 09:32:18,193 - Formatting dataset dicts takes 0.04 seconds 2022-10-20 09:32:18,193 - Dropped 0 scans. 33 scans remaining 2022-10-20 09:32:18,194 - Dropped references for 0/33 scans. 33 scans with reference remaining Precomputing masks: 0%| | 0/1 [00:00<?, ?it/s]Windows fatal exception: access violation | 1/12 [00:00<00:07, 1.45it/s]

Thread 0x00002bf4 (most recent call first): File "C:\Users\Alou\anaconda3\envs\research\lib\threading.py", line 300 in wait File "C:\Users\Alou\anaconda3\envs\research\lib\threading.py", line 552 in wait File "C:\Users\Alou\anaconda3\envs\research\lib\site-packages\tqdm_monitor.py", line 60 in run File "C:\Users\Alou\anaconda3\envs\research\lib\threading.py", line 926 in _bootstrap_inner File "C:\Users\Alou\anaconda3\envs\research\lib\threading.py", line 890 in _bootstrap

Current thread 0x00006714 (most recent call first): File "C:\Users\Alou\anaconda3\envs\research\lib\site-packages\sigpy\mri\samp.py", line 66 in poisson File "C:\Users\Alou\anaconda3\envs\research\lib\site-packages\meddlr\data\transforms\subsample.py", line 176 in call File "C:\Users\Alou\anaconda3\envs\research\lib\site-packages\skm_tea\data\transform.py", line 527 in _precompute_mask File "C:\Users\Alou\anaconda3\envs\research\lib\site-packages\skm_tea\data\transform.py", line 153 in precompute_masks File "C:\Users\Alou\Downloads\MoDL\Data\data_module.py", line 173 in _make_eval_datasets File "C:\Users\Alou\Downloads\MoDL\Data\data_module.py", line 68 in setup File "C:\Users\Alou\anaconda3\envs\research\lib\site-packages\pytorch_lightning\core\datamodule.py", line 92 in wrapped_fn File "data.py", line 44 in init File "data.py", line 69 in

ad12 commented 2 years ago

Thanks for the question @billzhonggz - apologies for the delay. I am working on a fix in both meddlr and skm-tea that will support multi-gpu training. More information in PR #22

I'll provide an updated command once that PR is merged in

billzhonggz commented 2 years ago

Thanks for the question @billzhonggz - apologies for the delay. I am working on a fix in both meddlr and skm-tea that will support multi-gpu training. More information in PR #22

I'll provide an updated command once that PR is merged in

Thanks for the update! Let me close the issue for now and test it later. I will re-open the issue if I have any follow up.

ad12 commented 2 years ago

In case it's helpful - adding some pointers below:

# Update meddlr
pip install --upgrade meddlr

# Train with 2 gpus, 4 workers per process for data loading, training/test batch size of 2
python tools/train_net.py --debug --config-file <your-config-file> --num-gpus=2 DATALOADER.NUM_WORKERS 4 SOLVER.TRAIN_BATCH_SIZE 2 SOLVER.TEST_BATCH_SIZE 2

Some tips with smaller gpus:

Make sure your machine has enough RAM to support your dataloader
If you run into dataloader issues, try changing the number of workers
If GPU is running out of space during validation/testing, use cfg.TEST.FLUSH_PERIOD = -1

aldiak commented 2 years ago

Thanks, I'll check it out.

On Fri, Oct 21, 2022 at 11:02 Arjun Desai @.***> wrote:

In case it's helpful - adding some pointers below:

Update meddlr

pip install --upgrade meddlr

Train with 2 gpus, 4 workers per process for data loading, training/test batch size of 2

python tools/train_net.py --debug --config-file --num-gpus=2 DATALOADER.NUM_WORKERS 4 SOLVER.TRAIN_BATCH_SIZE 2 SOLVER.TEST_BATCH_SIZE 2

Some tips with smaller gpus:

Make sure your machine has enough RAM to support your dataloader

If you run into dataloader issues, try changing the number of workers

If GPU is running out of space during validation/testing, use cfg.TEST.FLUSH_PERIOD = -1

— Reply to this email directly, view it on GitHub https://github.com/StanfordMIMI/skm-tea/issues/20#issuecomment-1286396151, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFOVODSFAD43NXAU3ZZF4BLWEIBVFANCNFSM6AAAAAAQP2YEC4 . You are receiving this because you commented.Message ID: @.***>

StanfordMIMI / skm-tea

Process Group Did Not Initialized when Multi-GPU Training #20

Update meddlr

Train with 2 gpus, 4 workers per process for data loading, training/test batch size of 2