Training and OOM - Githubissues

hcleung3325 commented 3 years ago

Thanks for your code. I tried to train the model with train_stage1.yml, and the Cuda OOM. I am using 2080 Ti, I tried to reduce the batch size from 16 to 2 and the GT_size from 192 to 48. However, the training still OOM. May I know is there anything I missed? Thanks.

JingyunLiang commented 3 years ago

MANet training doesn't take much memory. Did you turn on cal_lr_psnr? https://github.com/JingyunLiang/MANet/blob/34f90ba8888f4a1dd2a1127b97c2ec3706f06598/codes/options/train/train_stage1.yml#L28

hcleung3325 commented 3 years ago

MANet training doesn't take much memory. Did you turn on cal_lr_psnr?

https://github.com/JingyunLiang/MANet/blob/34f90ba8888f4a1dd2a1127b97c2ec3706f06598/codes/options/train/train_stage1.yml#L28

Thanks for reply. No, it keeps false.


#### general settings
name: 001_MANet_aniso_x4_TMO_40_stage1
use_tb_logger: true
model: blind
distortion: sr
scale: 4
gpu_ids: [1]
kernel_size: 21
code_length: 15
# train
sig_min: 0.7 # 0.7, 0.525, 0.35 for x4, x3, x2
sig_max: 10.0  # 10, 7.5, 5 for x4, x3, x2
train_noise: False
noise_high: 15
train_jpeg: False
jpeg_low: 70
# validation
sig: 1.6
sig1: 6 # 6, 5, 4 for x4, x3, x2
sig2: 1
theta: 0
rate_iso: 0 # 1 for iso, 0 for aniso
test_noise: False
noise: 15
test_jpeg: False
jpeg: 70
pca_path: ./pca_matrix_aniso21_15_x4.pth
cal_lr_psnr: False # calculate lr psnr consumes huge memory

#### datasets
datasets:
  train:
    name: TMO
    mode: GT
    dataroot_GT: ../datasets/HR
    dataroot_LQ: ~

    use_shuffle: true
    n_workers: 8
    batch_size: 4
    GT_size: 192
    LR_size: ~
    use_flip: true
    use_rot: true
    color: RGB
  val:
    name: Set5
    mode: GT
    dataroot_GT: ../../data
    dataroot_LQ: ~

#### network structures
network_G:
  which_model_G: MANet_s1
  in_nc: 3
  out_nc: ~
  nf: ~
  nb: ~
  gc: ~
  manet_nf: 128
  manet_nb: 1
  split: 2

#### path
path:
  pretrain_model_G: ~
  strict_load: true
  resume_state:  ~ #../experiments/001_MANet_aniso_x4_DIV2K_40_stage1/training_state/5000.state

#### training settings: learning rate scheme, loss
train:
  lr_G: !!float 2e-4
  lr_scheme: MultiStepLR
  beta1: 0.9
  beta2: 0.999
  niter: 300000
  warmup_iter: -1
  lr_steps: [100000, 150000, 200000, 250000]
  lr_gamma: 0.5
  restarts: ~
  restart_weights: ~
  eta_min: !!float 1e-7

  kernel_criterion: l1
  kernel_weight: 1.0

  manual_seed: 0
  val_freq: !!float 2e7

#### logger
logger:
  print_freq: 200
  save_checkpoint_freq: !!float 2e4

JingyunLiang commented 3 years ago

It's strange because MANet is a tiny model and consumes little memory. Do you have any problems testing the model? Can you try to set manet_nf=32 in training?

hcleung3325 commented 3 years ago

It's strange because MANet is a tiny model and consumes little memory. Do you have any problems testing the model? Can you try to set manet_nf=32 in training?

Thanks for reply. I have tried the manet_nf=32 still OOM.

hcleung3325 commented 3 years ago

is that to run python train.py --opt options/train/train_stage1.yml?

JingyunLiang commented 3 years ago

I think it's the problem of your GPU. Can you train other models normally? Can you test the MANet on your GPU?

hcleung3325 commented 3 years ago

My Gpu is 2080 Ti only get 11GB. Is that need a gpu with bigger ram to train it?

JingyunLiang commented 3 years ago

I don't think so. 2080 should at least be enough when manet_nf=32. Can you try to monitor the gpu usage by watch -d -n 0.5 nvidia-smi when you start to train the model?

hcleung3325 commented 3 years ago

Thanks a lot. The problem is solved. I can run the training now.

JingyunLiang / MANet

Training and OOM #8