'not enough memory'error after backup a training state

satoshils commented 3 years ago

My PC has 32G RAM,and they were only used 40% while training. But after backup a training state,the log says RuntimeError: [enforce fail at ..\c10\core\CPUAllocator.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 24774144 bytes. Buy new RAM! I already set n_workers: 1 and batch_size: 1,the result is that the training tried to allocate only 24774144 bytes. That's less than 25MB,and my PC has 32GB RAM,why it's not enough? MY CPU: AMD RYZEN 3700x (8cores) GPU: Geforce 1660super CUDA: cuda_11.0.2_451.48_win10 pytorch: torch-1.6.0-cp38-cp38-win_amd64

This is the log:

export CUDA_VISIBLE_DEVICES=0 20-08-11 07:32:47.805 - INFO: name: debug_newtest use_tb_logger: True model: sr scale: 4 gpu_ids: [0] datasets:[ train:[ name: DIV2K mode: LRHR dataroot_HR: ./data_samples/div2k/div2k_train_hr dataroot_LR: ./data_samples/div2k/DIV2K_train_LR subset_file: None use_shuffle: True n_workers: 1 batch_size: 1 HR_size: 64 use_flip: True use_rot: True phase: train scale: 4 data_type: img LR_nc: 3 HR_nc: 3 ] val:[ name: val_set5 mode: LRHR dataroot_HR: ./data_samples/div2k/div2k_valid_hr dataroot_LR: ./data_samples/div2k/DIV2K_valid_LR phase: val scale: 4 data_type: img LR_nc: 3 HR_nc: 3 ] ] path:[ root: ./ pretrain_model_G: ./experiments/pretrained_models/4x_ArtStation1337_FatalityMKII90000G_05_rebout_02.pth experiments_root: ./experiments\debug_newtest models: ./experiments\debug_newtest\models training_state: ./experiments\debug_newtest\training_state log: ./experiments\debug_newtest val_images: ./experiments\debug_newtest\val_images ] network_G:[ which_model_G: RRDB_net norm_type: None mode: CNA nf: 64 nb: 23 in_nc: 3 out_nc: 3 gc: 32 group: 1 scale: 4 ] train:[ lr_G: 0.0002 lr_scheme: MultiStepLR lr_steps: [200000, 400000, 600000, 800000] lr_gamma: 0.5 pixel_criterion: l1 pixel_weight: 1 val_freq: 8 manual_seed: 0 niter: 1000000 lr_decay_iter: 10 ] logger:[ print_freq: 2 save_checkpoint_freq: 8 backup_freq: 2 ] is_train: True batch_multiplier: 1

20-08-11 07:32:47.805 - INFO: Random seed: 0 20-08-11 07:32:47.815 - INFO: Dataset [LRHRDataset - DIV2K] is created. 20-08-11 07:32:47.815 - INFO: Number of train images: 800, iters: 800 20-08-11 07:32:47.815 - INFO: Total epochs needed: 1250 for iters 1,000,000 20-08-11 07:32:47.817 - INFO: Dataset [LRHRDataset - val_set5] is created. 20-08-11 07:32:47.817 - INFO: Number of val images in [val_set5]: 100 20-08-11 07:32:47.946 - INFO: Initialization method [kaiming] 20-08-11 07:32:49.037 - INFO: Loading pretrained model for G [./experiments/pretrained_models/4x_ArtStation1337_FatalityMKII90000G_05_rebout_02.pth] ... 20-08-11 07:32:49.205 - INFO: Remove frequency separation. 20-08-11 07:32:49.205 - INFO: Remove feature loss. 20-08-11 07:32:49.206 - INFO: Remove HFEN loss. 20-08-11 07:32:49.207 - INFO: Remove TV loss. 20-08-11 07:32:49.207 - INFO: Remove SSIM loss. 20-08-11 07:32:49.207 - INFO: Remove LPIPS loss. 20-08-11 07:32:49.207 - INFO: Remove GAN loss. 20-08-11 07:32:49.211 - INFO: Model [SRRaGANModel] is created. 20-08-11 07:32:49.211 - INFO: Start training from epoch: 0, iter: 0 20-08-11 07:32:51.501 - INFO: <epoch: 0, iter: 2, lr:2.000e-04> l_g_pix: 7.1235e-02 20-08-11 07:32:51.796 - INFO: Backup models and training states saved. 20-08-11 07:32:52.356 - INFO: <epoch: 0, iter: 4, lr:2.000e-04> l_g_pix: 5.8562e-02 20-08-11 07:32:52.600 - INFO: Backup models and training states saved. 20-08-11 07:32:53.110 - INFO: <epoch: 0, iter: 6, lr:2.000e-04> l_g_pix: 4.0749e-02 20-08-11 07:32:53.568 - INFO: Backup models and training states saved. 20-08-11 07:32:54.078 - INFO: <epoch: 0, iter: 8, lr:2.000e-04> l_g_pix: 2.3442e-02 20-08-11 07:32:54.280 - INFO: Models and training states saved. 20-08-11 07:32:54.576 - INFO: Backup models and training states saved. Setting up Perceptual loss... Loading model from: D:\FUN\GAME\3ds\Texture\ESRGAN\train_AI\BasicSR-lite\codes\models\modules\LPIPS\lpips_weights\v0.1\squeeze.pth ...[net-lin [squeeze]] initialized ...Done Traceback (most recent call last): File "./codes/train.py", line 252, in main() File "./codes/train.py", line 213, in main avg_lpips += lpips.calculate_lpips(cropped_sr_img, cropped_gt_img) File "D:\FUN\GAME\3ds\Texture\ESRGAN\train_AI\BasicSR-lite\codes\models\modules\LPIPS\compute_dists.py", line 33, in calculate_lpips dist01 = model.forward(img2,img1) File "D:\FUN\GAME\3ds\Texture\ESRGAN\train_AI\BasicSR-lite\codes\models\modules\LPIPS\perceptual_loss.py", line 39, in forward return self.model.forward(target, pred) File "D:\FUN\GAME\3ds\Texture\ESRGAN\train_AI\BasicSR-lite\codes\models\modules\LPIPS\dist_model.py", line 116, in forward return self.net.forward(in0, in1, retPerLayer=retPerLayer) File "D:\FUN\GAME\3ds\Texture\ESRGAN\train_AI\BasicSR-lite\codes\models\modules\LPIPS\networks_basic.py", line 67, in forward feats0[kk], feats1[kk] = util.normalize_tensor(outs0[kk]), util.normalize_tensor(outs1[kk]) File "D:\FUN\GAME\3ds\Texture\ESRGAN\train_AI\BasicSR-lite\codes\models\modules\LPIPS\perceptual_loss.py", line 42, in normalize_tensor norm_factor = torch.sqrt(torch.sum(in_feat**2,dim=1,keepdim=True)) RuntimeError: [enforce fail at ..\c10\core\CPUAllocator.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 24774144 bytes. Buy new RAM!

satoshils commented 3 years ago

my yaml file is

name: debug_newtest use_tb_logger: true model: sr scale: 4 gpu_ids:

0 datasets: train: name: DIV2K mode: LRHR dataroot_HR: ./data_samples/div2k/div2k_train_hr dataroot_LR: ./data_samples/div2k/DIV2K_train_LR subset_file: null use_shuffle: true n_workers: 1 batch_size: 1 HR_size: 64 use_flip: true use_rot: true val: name: val_set5 mode: LRHR dataroot_HR: ./data_samples/div2k/div2k_valid_hr dataroot_LR: ./data_samples/div2k/DIV2K_valid_LR path: root: ./ pretrain_model_G: ./experiments/pretrained_models/4x_ArtStation1337_FatalityMKII90000G_05_rebout_02.pth network_G: which_model_G: RRDB_net norm_type: null mode: CNA nf: 64 nb: 23 in_nc: 3 out_nc: 3 gc: 32 group: 1 train: lr_G: 0.0002 lr_scheme: MultiStepLR lr_steps:
- 200000
- 400000
- 600000
- 800000 lr_gamma: 0.5 pixel_criterion: l1 pixel_weight: 1 val_freq: 1e3 manual_seed: 0 niter: 1000000 logger: print_freq: 100 save_checkpoint_freq: 1e3

satoshils commented 3 years ago

last time trainings Traceback

Traceback (most recent call last): File "./codes/train.py", line 252, in main() File "./codes/train.py", line 133, in main model.optimize_parameters(train_gen, current_step) File "D:\FUN\GAME\3ds\Texture\ESRGAN\train_AI\BasicSR-lite\codes\models\SRRaGAN_model.py", line 347, in optimize_parameters self.fake_H = self.netG(self.var_L) File "C:\Users\Satoshi\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\nn\modules\module.py", line 722, in _call_impl result = self.forward(*input, kwargs) File "C:\Users\Satoshi\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\nn\parallel\data_parallel.py", line 153, in forward return self.module(*inputs[0], *kwargs[0]) File "C:\Users\Satoshi\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\nn\modules\module.py", line 722, in _call_impl result = self.forward(input, kwargs) File "D:\FUN\GAME\3ds\Texture\ESRGAN\train_AI\BasicSR-lite\codes\models\modules\architectures\RRDBNet_arch.py", line 46, in forward x = self.model(x) File "C:\Users\Satoshi\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\nn\modules\module.py", line 722, in _call_impl result = self.forward(*input, *kwargs) File "C:\Users\Satoshi\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\nn\modules\container.py", line 117, in forward input = module(input) File "C:\Users\Satoshi\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\nn\modules\module.py", line 722, in _call_impl result = self.forward(input, **kwargs) File "C:\Users\Satoshi\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\nn\modules\conv.py", line 419, in forward return self._conv_forward(input, self.weight) File "C:\Users\Satoshi\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\nn\modules\conv.py", line 415, in _conv_forward return F.conv2d(input, weight, self.bias, self.stride, RuntimeError: Calculated padded input size per channel: (2 x 2). Kernel size: (3 x 3). Kernel size can't be greater than actual input size

BlueAmulet / BasicSR

'not enough memory'error after backup a training state #10