ChrisChen1023 / HINT

HINT: High-quality INpainting Transformer with Enhanced Attention and Mask-aware Encoding
MIT License
28 stars 3 forks source link

torch.cuda.Out0fMemoryError:CUDA out of memory #8

Open WJDNJSDJ opened 5 months ago

WJDNJSDJ commented 5 months ago

Hello, thank you very much for your contribution. I encountered the following problem while running this code: insufficient memory error, CUDA memory shortage. I have set the training batch to the minimum size, and the error occurred even after running out of memory. Please do not hesitate to give me your advice. Thank you very much, and I wish you a happy life and all the best!!

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 6.08 GiB (GPU 0; 15.70 GiB total capacity; 3.28 GiB already allocated; 6.08 GiB free; 7.45 GiB reserved in total by PyTorch) If reserved memory is >>allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

ChrisChen1023 commented 5 months ago

Hi,

Thanks for your interests. We train our model with a single A100 GPU, which is mentioned in our paper. What you could consider is reducing some of the hyper-parameters, such as the number of Transformer block in each stage, in this way you might be able to train our own model. Hope this is helpful.

BINAIQIN commented 5 months ago

Thank you very much for your reply despite your busy schedule. With your help, this problem can be solved, but when running, len(self.data_info)==0 in the Dataset means that there is no data in the dataset passed to the dataloader. Currently 1) Check the path in the dataset and feel there is no problem; 2) Add print("**", len(train_loader)) in the HINT.py while loop , the output is 0. The running log is as follows: Cuda is available Setting up [LPIPS] perceptual loss: trunk [vgg], v[0.1], spatial [off] /home/liu/anaconda3/envs/HINT/lib/python3.8/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and will be removed in 0.15, please use 'weights' instead. warnings.warn( /home/liu/anaconda3/envs/HINT/lib/python3.8/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or None for 'weights' are deprecated since 0.13 and will be removed in 0.15. The current behavior is equivalent to passing weights=VGG19_Weights.IMAGENET1K_V1. You can also use weights=VGG19_Weights.DEFAULT to get the most up-to-date weights. warnings.warn(msg) /home/liu/anaconda3/envs/HINT/lib/python3.8/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or None for 'weights' are deprecated since 0.13 and will be removed in 0.15. The current behavior is equivalent to passing weights=VGG16_Weights.IMAGENET1K_V1. You can also use weights=VGG16_Weights.DEFAULT to get the most up-to-date weights. warnings.warn(msg) Loading model from: /home/liu/anaconda3/envs/HINT/lib/python3.8/site-packages/lpips/weights/v0.1/vgg.pth module 'numpy' has no attribute 'str'. np.str was a deprecated alias for the builtin str. To avoid this error in existing code, use str by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use np.str_ here. The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations Model configurations:

MODE: 1 # 1: train, 2: test MODEL: 2 # 2: inpaint model MASK: 3 # 0: no mask, 1: random block, 2: center mask, 3: external, 4: 50% external, 50% random block, 5: (50% no mask, 25% ramdom block, 25% external) 6: external non-random SEED: 10 # random seed GPU: [0] # list of gpu ids AUGMENTATION_TRAIN: 0 # 1: use augmentation to train landmark predictor 0:not use TRAIN_INPAINT_IMAGE_FLIST: /home/liu/ZZB/HINT-main/script/Train_GT/Train_GT.txt TEST_INPAINT_IMAGE_FLIST: TRAIN_MASK_FLIST: /home/liu/ZZB/HINT-main/script/Mask.txt TEST_MASK_FLIST: LR: 0.0001 # learning rate D2G_LR: 0.1 # discriminator/generator learning rate ratio BETA1: 0.9 # adam optimizer beta1 BETA2: 0.999 # adam optimizer beta2 WD: 0 LR_Decay: 1 BATCH_SIZE: 4 # input batch size for training INPUT_SIZE: 256 # input image size for training 0 for original size

MAX_ITERS: 300001 # maximum number of iterations to train the model

MAX_ITERS: 10000 # maximum number of iterations to train the model L1_LOSS_WEIGHT: 1 # l1 loss weight STYLE_LOSS_WEIGHT: 250 # style loss weight CONTENT_LOSS_WEIGHT: 0.1 # perceptual loss weight INPAINT_ADV_LOSS_WEIGHT: 0.01 # adversarial loss weight GAN_LOSS: lsgan # nsgan | lsgan | hinge GAN_POOL_SIZE: 0 # fake images pool size SAVE_INTERVAL: 1000 # how many iterations to wait before saving model (0: never) EVAL_INTERVAL: 0 # how many iterations to wait before model evaluation (0: never) LOG_INTERVAL: 100 # how many iterations to wait before logging training status (0: never)

start training... Training epoch: 1 **** 0 Training epoch: 2 **** 0 Training epoch: 3 **** 0 Training epoch: 4 **** 0 Training epoch: 5 **** 0 /home/liu/ZZB/HINT-main/src/dataset.py:164: FutureWarning: In the future np.str will be defined as the corresponding NumPy scalar. return np.genfromtxt(flist, dtype=np.str, encoding='utf-8') Training epoch: 6 **** 0 Training epoch: 7 **** 0 Training epoch: 8 **** 0 Training epoch: 9 **** 0 Training epoch: 10 **** 0

ChrisChen1023 commented 5 months ago

Thank you very much for your reply despite your busy schedule.

With your help, this problem can be solved, but when running, len(self.data_info)==0 in the Dataset means that there is no data in the dataset passed to the dataloader. Currently 1) Check the path in the dataset and feel there is no problem; 2) Add print("**", len(train_loader)) in the HINT.py while loop , the output is 0. The running log is as follows: Cuda is available Setting up [LPIPS] perceptual loss: trunk [vgg], v[0.1], spatial [off] /home/liu/anaconda3/envs/HINT/lib/python3.8/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and will be removed in 0.15, please use 'weights' instead. warnings.warn( /home/liu/anaconda3/envs/HINT/lib/python3.8/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or None for 'weights' are deprecated since 0.13 and will be removed in 0.15. The current behavior is equivalent to passing weights=VGG19_Weights.IMAGENET1K_V1. You can also use weights=VGG19_Weights.DEFAULT to get the most up-to-date weights. warnings.warn(msg) /home/liu/anaconda3/envs/HINT/lib/python3.8/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or None for 'weights' are deprecated since 0.13 and will be removed in 0.15. The current behavior is equivalent to passing weights=VGG16_Weights.IMAGENET1K_V1. You can also use weights=VGG16_Weights.DEFAULT to get the most up-to-date weights. warnings.warn(msg) Loading model from: /home/liu/anaconda3/envs/HINT/lib/python3.8/site-packages/lpips/weights/v0.1/vgg.pth module 'numpy' has no attribute 'str'. np.str was a deprecated alias for the builtin str. To avoid this error in existing code, use str by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use np.str_ here. The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations Model configurations:

MODE: 1 # 1: train, 2: test

MODEL: 2 # 2: inpaint model MASK: 3 # 0: no mask, 1: random block, 2: center mask, 3: external, 4: 50% external, 50% random block, 5: (50% no mask, 25% ramdom block, 25% external) 6: external non-random SEED: 10 # random seed GPU: [0] # list of gpu ids AUGMENTATION_TRAIN: 0 # 1: use augmentation to train landmark predictor 0:not use TRAIN_INPAINT_IMAGE_FLIST: /home/liu/ZZB/HINT-main/script/Train_GT/Train_GT.txt TEST_INPAINT_IMAGE_FLIST: TRAIN_MASK_FLIST: /home/liu/ZZB/HINT-main/script/Mask.txt TEST_MASK_FLIST: LR: 0.0001 # learning rate D2G_LR: 0.1 # discriminator/generator learning rate ratio BETA1: 0.9 # adam optimizer beta1 BETA2: 0.999 # adam optimizer beta2 WD: 0 LR_Decay: 1 BATCH_SIZE: 4 # input batch size for training INPUT_SIZE: 256 # input image size for training 0 for original size

MAX_ITERS: 300001 # maximum number of iterations to train the model

MAX_ITERS: 10000 # maximum number of iterations to train the model L1_LOSS_WEIGHT: 1 # l1 loss weight STYLE_LOSS_WEIGHT: 250 # style loss weight CONTENT_LOSS_WEIGHT: 0.1 # perceptual loss weight INPAINT_ADV_LOSS_WEIGHT: 0.01 # adversarial loss weight GAN_LOSS: lsgan # nsgan | lsgan | hinge GAN_POOL_SIZE: 0 # fake images pool size SAVE_INTERVAL: 1000 # how many iterations to wait before saving model (0: never) EVAL_INTERVAL: 0 # how many iterations to wait before model evaluation (0: never) LOG_INTERVAL: 100 # how many iterations to wait before logging training status (0: never) start training... Training epoch: 1 **** 0 Training epoch: 2 **** 0 Training epoch: 3 **** 0 Training epoch: 4 **** 0 Training epoch: 5 **** 0 /home/liu/ZZB/HINT-main/src/dataset.py:164: FutureWarning: In the future np.str will be defined as the corresponding NumPy scalar. return np.genfromtxt(flist, dtype=np.str, encoding='utf-8') Training epoch: 6 **** 0 Training epoch: 7 **** 0 Training epoch: 8 **** 0 Training epoch: 9 **** 0 Training epoch: 10 **** 0

Hi,

Is the address referred to the .flist file?