advimman / lama

🦙 LaMa Image Inpainting, Resolution-robust Large Mask Inpainting with Fourier Convolutions, WACV 2022
https://advimman.github.io/lama-project/
Apache License 2.0
8.12k stars 861 forks source link

Is learning in progress? #258

Closed kwanwoo02 closed 1 year ago

kwanwoo02 commented 1 year ago

hello I'm not sure if training is in progress. Is there a way to check this?

I followed the following method: readme_create_mydataset

After preparing the dataset as follows, I entered the command python3 bin/train.py -cn big-lama location=my_dataset data.batch_size=10.

My server spec has 4 A100 40GB. The log appears like this and there are no other changes. I want to know if the train is currently running.

The logs so far are as follows:

[2023-09-25 17:09:54,372][root][INFO] - Make discriminator pix2pixhd_nlayer
[2023-09-25 17:09:54,419][root][INFO] - Make visualizer directory
[2023-09-25 17:09:54,419][root][INFO] - Make evaluator default
[2023-09-25 17:09:55,717][saicinpainting.evaluation.losses.base_loss][INFO] - FIDscore init called
[2023-09-25 17:09:55,717][saicinpainting.evaluation.losses.fid.inception][INFO] - fid_inception_v3 called
/home/ubuntu/anaconda3/envs/lama/lib/python3.6/site-packages/torchvision/models/inception.py:83: FutureWarning: The default weight initialization of inception_v3 will be changed in future releases of torchvision. If you wish to keep the old behavior (which leads to long initialization times due to scipy/scipy#11299), please set init_weights=True.
  ' due to scipy/scipy#11299), please set init_weights=True.', FutureWarning)
[2023-09-25 17:09:56,079][saicinpainting.evaluation.losses.fid.inception][INFO] - models.inception_v3 done
[2023-09-25 17:09:56,207][saicinpainting.evaluation.losses.fid.inception][INFO] - fid_inception_v3 patching done
[2023-09-25 17:09:56,260][saicinpainting.evaluation.losses.fid.inception][INFO] - fid_inception_v3 weights downloaded
[2023-09-25 17:09:56,334][saicinpainting.evaluation.losses.fid.inception][INFO] - fid_inception_v3 weights loaded into model
[2023-09-25 17:09:56,338][saicinpainting.evaluation.losses.base_loss][INFO] - FIDscore init done
/home/ubuntu/anaconda3/envs/lama/lib/python3.6/site-packages/torch/cuda/__init__.py:143: UserWarning: 
NVIDIA A100-PCIE-40GB with CUDA capability sm_80 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_61 sm_70 sm_75 compute_37.
If you want to use the NVIDIA A100-PCIE-40GB GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

  warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))

[2023-09-25 17:47:11,367][saicinpainting.evaluation.evaluator][INFO] - <class 'saicinpainting.evaluation.evaluator.InpaintingEvaluatorOnline'> init called
[2023-09-25 17:47:11,369][saicinpainting.evaluation.evaluator][INFO] - <class 'saicinpainting.evaluation.evaluator.InpaintingEvaluatorOnline'> init done
[2023-09-25 17:47:11,370][root][INFO] - Make evaluator default
[2023-09-25 17:47:12,600][saicinpainting.evaluation.losses.base_loss][INFO] - FIDscore init called
[2023-09-25 17:47:12,600][saicinpainting.evaluation.losses.base_loss][INFO] - FIDscore init done
[2023-09-25 17:47:12,603][saicinpainting.evaluation.evaluator][INFO] - <class 'saicinpainting.evaluation.evaluator.InpaintingEvaluatorOnline'> init called
[2023-09-25 17:47:12,603][saicinpainting.evaluation.evaluator][INFO] - <class 'saicinpainting.evaluation.evaluator.InpaintingEvaluatorOnline'> init done
[2023-09-25 17:47:12,604][saicinpainting.training.trainers.base][INFO] - Discriminator
NLayerDiscriminator(
  (model0): Sequential(
    (0): Conv2d(3, 64, kernel_size=(4, 4), stride=(2, 2), padding=(2, 2))
    (1): LeakyReLU(negative_slope=0.2, inplace=True)
  )
  (model1): Sequential(
    (0): Conv2d(64, 128, kernel_size=(4, 4), stride=(2, 2), padding=(2, 2))
    (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): LeakyReLU(negative_slope=0.2, inplace=True)
  )
  (model2): Sequential(
    (0): Conv2d(128, 256, kernel_size=(4, 4), stride=(2, 2), padding=(2, 2))
    (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): LeakyReLU(negative_slope=0.2, inplace=True)
  )
  (model3): Sequential(
    (0): Conv2d(256, 512, kernel_size=(4, 4), stride=(2, 2), padding=(2, 2))
    (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): LeakyReLU(negative_slope=0.2, inplace=True)
  )
  (model4): Sequential(
    (0): Conv2d(512, 512, kernel_size=(4, 4), stride=(1, 1), padding=(2, 2))
    (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): LeakyReLU(negative_slope=0.2, inplace=True)
  )
  (model5): Sequential(
    (0): Conv2d(512, 1, kernel_size=(4, 4), stride=(1, 1), padding=(2, 2))
  )
)
Loading weights for net_encoder
[2023-09-25 17:47:13,088][saicinpainting.training.trainers.base][INFO] - BaseInpaintingTrainingModule init done
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
Detectron v2 is not installed
/home/ubuntu/anaconda3/envs/lama/lib/python3.6/site-packages/hydra/core/default_element.py:127: UserWarning: In 'hydra/overrides': Usage of deprecated keyword in package header '# @package _group_'.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/changes_to_package_header for more information
  See {url} for more information"""
/home/ubuntu/anaconda3/envs/lama/lib/python3.6/site-packages/hydra/core/default_element.py:127: UserWarning: In 'trainer/any_gpu_large_ssim_ddp_final': Usage of deprecated keyword in package header '# @package _group_'.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/changes_to_package_header for more information
  See {url} for more information"""
/home/ubuntu/anaconda3/envs/lama/lib/python3.6/site-packages/hydra/core/default_element.py:127: UserWarning: In 'evaluator/default_inpainted': Usage of deprecated keyword in package header '# @package _group_'.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/changes_to_package_header for more information
  See {url} for more information"""
/home/ubuntu/anaconda3/envs/lama/lib/python3.6/site-packages/hydra/core/default_element.py:127: UserWarning: In 'visualizer/directory': Usage of deprecated keyword in package header '# @package _group_'.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/changes_to_package_header for more information
  See {url} for more information"""
/home/ubuntu/anaconda3/envs/lama/lib/python3.6/site-packages/hydra/core/default_element.py:127: UserWarning: In 'optimizers/default_optimizers': Usage of deprecated keyword in package header '# @package _group_'.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/changes_to_package_header for more information
  See {url} for more information"""
/home/ubuntu/anaconda3/envs/lama/lib/python3.6/site-packages/hydra/core/default_element.py:127: UserWarning: In 'discriminator/pix2pixhd_nlayer': Usage of deprecated keyword in package header '# @package _group_'.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/changes_to_package_header for more information
  See {url} for more information"""
/home/ubuntu/anaconda3/envs/lama/lib/python3.6/site-packages/hydra/core/default_element.py:127: UserWarning: In 'data/abl-04-256-mh-dist': Usage of deprecated keyword in package header '# @package _group_'.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/changes_to_package_header for more information
  See {url} for more information"""
[2023-09-25 17:47:15,709][saicinpainting.utils][WARNING] - Setting signal 10 handler <function print_traceback_handler at 0x7f59af68bd90>
[2023-09-25 17:47:15,710][root][INFO] - Make training model default
[2023-09-25 17:47:15,711][saicinpainting.training.trainers.base][INFO] - BaseInpaintingTrainingModule init called
[2023-09-25 17:47:15,711][root][INFO] - Make generator ffc_resnet
/home/ubuntu/anaconda3/envs/lama/lib/python3.6/site-packages/omegaconf/resolvers/__init__.py:13: UserWarning: The `env` resolver is deprecated, see https://github.com/omry/omegaconf/issues/573
  "The `env` resolver is deprecated, see https://github.com/omry/omegaconf/issues/573"
[2023-09-25 17:47:16,268][root][INFO] - Make discriminator pix2pixhd_nlayer
[2023-09-25 17:47:16,315][root][INFO] - Make visualizer directory
[2023-09-25 17:47:16,315][root][INFO] - Make evaluator default
[2023-09-25 17:47:17,602][saicinpainting.evaluation.losses.base_loss][INFO] - FIDscore init called
[2023-09-25 17:47:17,602][saicinpainting.evaluation.losses.fid.inception][INFO] - fid_inception_v3 called
/home/ubuntu/anaconda3/envs/lama/lib/python3.6/site-packages/torchvision/models/inception.py:83: FutureWarning: The default weight initialization of inception_v3 will be changed in future releases of torchvision. If you wish to keep the old behavior (which leads to long initialization times due to scipy/scipy#11299), please set init_weights=True.
  ' due to scipy/scipy#11299), please set init_weights=True.', FutureWarning)
[2023-09-25 17:47:17,981][saicinpainting.evaluation.losses.fid.inception][INFO] - models.inception_v3 done
[2023-09-25 17:47:18,110][saicinpainting.evaluation.losses.fid.inception][INFO] - fid_inception_v3 patching done
[2023-09-25 17:47:18,160][saicinpainting.evaluation.losses.fid.inception][INFO] - fid_inception_v3 weights downloaded
[2023-09-25 17:47:18,235][saicinpainting.evaluation.losses.fid.inception][INFO] - fid_inception_v3 weights loaded into model
[2023-09-25 17:47:18,238][saicinpainting.evaluation.losses.base_loss][INFO] - FIDscore init done
/home/ubuntu/anaconda3/envs/lama/lib/python3.6/site-packages/torch/cuda/__init__.py:143: UserWarning: 
NVIDIA A100-PCIE-40GB with CUDA capability sm_80 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_61 sm_70 sm_75 compute_37.
If you want to use the NVIDIA A100-PCIE-40GB GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

  warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
Detectron v2 is not installed
/home/ubuntu/anaconda3/envs/lama/lib/python3.6/site-packages/hydra/core/default_element.py:127: UserWarning: In 'hydra/overrides': Usage of deprecated keyword in package header '# @package _group_'.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/changes_to_package_header for more information
  See {url} for more information"""
/home/ubuntu/anaconda3/envs/lama/lib/python3.6/site-packages/hydra/core/default_element.py:127: UserWarning: In 'trainer/any_gpu_large_ssim_ddp_final': Usage of deprecated keyword in package header '# @package _group_'.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/changes_to_package_header for more information
  See {url} for more information"""
/home/ubuntu/anaconda3/envs/lama/lib/python3.6/site-packages/hydra/core/default_element.py:127: UserWarning: In 'evaluator/default_inpainted': Usage of deprecated keyword in package header '# @package _group_'.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/changes_to_package_header for more information
  See {url} for more information"""
/home/ubuntu/anaconda3/envs/lama/lib/python3.6/site-packages/hydra/core/default_element.py:127: UserWarning: In 'visualizer/directory': Usage of deprecated keyword in package header '# @package _group_'.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/changes_to_package_header for more information
  See {url} for more information"""
/home/ubuntu/anaconda3/envs/lama/lib/python3.6/site-packages/hydra/core/default_element.py:127: UserWarning: In 'optimizers/default_optimizers': Usage of deprecated keyword in package header '# @package _group_'.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/changes_to_package_header for more information
  See {url} for more information"""
/home/ubuntu/anaconda3/envs/lama/lib/python3.6/site-packages/hydra/core/default_element.py:127: UserWarning: In 'discriminator/pix2pixhd_nlayer': Usage of deprecated keyword in package header '# @package _group_'.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/changes_to_package_header for more information
  See {url} for more information"""
/home/ubuntu/anaconda3/envs/lama/lib/python3.6/site-packages/hydra/core/default_element.py:127: UserWarning: In 'data/abl-04-256-mh-dist': Usage of deprecated keyword in package header '# @package _group_'.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/changes_to_package_header for more information
  See {url} for more information"""
[2023-09-25 17:47:19,440][saicinpainting.utils][WARNING] - Setting signal 10 handler <function print_traceback_handler at 0x7f71bf68dd90>
[2023-09-25 17:47:19,441][root][INFO] - Make training model default
[2023-09-25 17:47:19,441][saicinpainting.training.trainers.base][INFO] - BaseInpaintingTrainingModule init called
[2023-09-25 17:47:19,441][root][INFO] - Make generator ffc_resnet
/home/ubuntu/anaconda3/envs/lama/lib/python3.6/site-packages/omegaconf/resolvers/__init__.py:13: UserWarning: The `env` resolver is deprecated, see https://github.com/omry/omegaconf/issues/573
  "The `env` resolver is deprecated, see https://github.com/omry/omegaconf/issues/573"
[2023-09-25 17:47:19,963][root][INFO] - Make discriminator pix2pixhd_nlayer
[2023-09-25 17:47:20,010][root][INFO] - Make visualizer directory
[2023-09-25 17:47:20,010][root][INFO] - Make evaluator default
[2023-09-25 17:47:21,418][saicinpainting.evaluation.losses.base_loss][INFO] - FIDscore init called
[2023-09-25 17:47:21,418][saicinpainting.evaluation.losses.fid.inception][INFO] - fid_inception_v3 called
/home/ubuntu/anaconda3/envs/lama/lib/python3.6/site-packages/torchvision/models/inception.py:83: FutureWarning: The default weight initialization of inception_v3 will be changed in future releases of torchvision. If you wish to keep the old behavior (which leads to long initialization times due to scipy/scipy#11299), please set init_weights=True.
  ' due to scipy/scipy#11299), please set init_weights=True.', FutureWarning)
[2023-09-25 17:47:21,783][saicinpainting.evaluation.losses.fid.inception][INFO] - models.inception_v3 done
[2023-09-25 17:47:21,914][saicinpainting.evaluation.losses.fid.inception][INFO] - fid_inception_v3 patching done
[2023-09-25 17:47:21,973][saicinpainting.evaluation.losses.fid.inception][INFO] - fid_inception_v3 weights downloaded
Detectron v2 is not installed
[2023-09-25 17:47:22,060][saicinpainting.evaluation.losses.fid.inception][INFO] - fid_inception_v3 weights loaded into model
[2023-09-25 17:47:22,066][saicinpainting.evaluation.losses.base_loss][INFO] - FIDscore init done
/home/ubuntu/anaconda3/envs/lama/lib/python3.6/site-packages/torch/cuda/__init__.py:143: UserWarning: 
NVIDIA A100-PCIE-40GB with CUDA capability sm_80 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_61 sm_70 sm_75 compute_37.
If you want to use the NVIDIA A100-PCIE-40GB GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

  warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
/home/ubuntu/anaconda3/envs/lama/lib/python3.6/site-packages/hydra/core/default_element.py:127: UserWarning: In 'hydra/overrides': Usage of deprecated keyword in package header '# @package _group_'.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/changes_to_package_header for more information
  See {url} for more information"""
/home/ubuntu/anaconda3/envs/lama/lib/python3.6/site-packages/hydra/core/default_element.py:127: UserWarning: In 'trainer/any_gpu_large_ssim_ddp_final': Usage of deprecated keyword in package header '# @package _group_'.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/changes_to_package_header for more information
  See {url} for more information"""
/home/ubuntu/anaconda3/envs/lama/lib/python3.6/site-packages/hydra/core/default_element.py:127: UserWarning: In 'evaluator/default_inpainted': Usage of deprecated keyword in package header '# @package _group_'.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/changes_to_package_header for more information
  See {url} for more information"""
/home/ubuntu/anaconda3/envs/lama/lib/python3.6/site-packages/hydra/core/default_element.py:127: UserWarning: In 'visualizer/directory': Usage of deprecated keyword in package header '# @package _group_'.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/changes_to_package_header for more information
  See {url} for more information"""
/home/ubuntu/anaconda3/envs/lama/lib/python3.6/site-packages/hydra/core/default_element.py:127: UserWarning: In 'optimizers/default_optimizers': Usage of deprecated keyword in package header '# @package _group_'.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/changes_to_package_header for more information
  See {url} for more information"""
/home/ubuntu/anaconda3/envs/lama/lib/python3.6/site-packages/hydra/core/default_element.py:127: UserWarning: In 'discriminator/pix2pixhd_nlayer': Usage of deprecated keyword in package header '# @package _group_'.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/changes_to_package_header for more information
  See {url} for more information"""
/home/ubuntu/anaconda3/envs/lama/lib/python3.6/site-packages/hydra/core/default_element.py:127: UserWarning: In 'data/abl-04-256-mh-dist': Usage of deprecated keyword in package header '# @package _group_'.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/changes_to_package_header for more information
  See {url} for more information"""
[2023-09-25 17:47:22,552][saicinpainting.utils][WARNING] - Setting signal 10 handler <function print_traceback_handler at 0x7f488a6fcd90>
[2023-09-25 17:47:22,553][root][INFO] - Make training model default
[2023-09-25 17:47:22,553][saicinpainting.training.trainers.base][INFO] - BaseInpaintingTrainingModule init called
[2023-09-25 17:47:22,553][root][INFO] - Make generator ffc_resnet
/home/ubuntu/anaconda3/envs/lama/lib/python3.6/site-packages/omegaconf/resolvers/__init__.py:13: UserWarning: The `env` resolver is deprecated, see https://github.com/omry/omegaconf/issues/573
  "The `env` resolver is deprecated, see https://github.com/omry/omegaconf/issues/573"
[2023-09-25 17:47:23,111][root][INFO] - Make discriminator pix2pixhd_nlayer
[2023-09-25 17:47:23,167][root][INFO] - Make visualizer directory
[2023-09-25 17:47:23,168][root][INFO] - Make evaluator default
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4
[2023-09-25 17:47:24,471][saicinpainting.evaluation.losses.base_loss][INFO] - FIDscore init called
[2023-09-25 17:47:24,472][saicinpainting.evaluation.losses.fid.inception][INFO] - fid_inception_v3 called
/home/ubuntu/anaconda3/envs/lama/lib/python3.6/site-packages/torchvision/models/inception.py:83: FutureWarning: The default weight initialization of inception_v3 will be changed in future releases of torchvision. If you wish to keep the old behavior (which leads to long initialization times due to scipy/scipy#11299), please set init_weights=True.
  ' due to scipy/scipy#11299), please set init_weights=True.', FutureWarning)
[2023-09-25 17:47:24,824][saicinpainting.evaluation.losses.fid.inception][INFO] - models.inception_v3 done
[2023-09-25 17:47:24,954][saicinpainting.evaluation.losses.fid.inception][INFO] - fid_inception_v3 patching done
[2023-09-25 17:47:25,004][saicinpainting.evaluation.losses.fid.inception][INFO] - fid_inception_v3 weights downloaded
[2023-09-25 17:47:25,077][saicinpainting.evaluation.losses.fid.inception][INFO] - fid_inception_v3 weights loaded into model
[2023-09-25 17:47:25,082][saicinpainting.evaluation.losses.base_loss][INFO] - FIDscore init done
/home/ubuntu/anaconda3/envs/lama/lib/python3.6/site-packages/torch/cuda/__init__.py:143: UserWarning: 
NVIDIA A100-PCIE-40GB with CUDA capability sm_80 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_61 sm_70 sm_75 compute_37.
If you want to use the NVIDIA A100-PCIE-40GB GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

  warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
amangupta2303 commented 1 year ago

@kwanwoo02 I think training is running but it is not able to utilize your gpu properly due to gpu architecture mismatch. For this you can try to install latest version for Pytorch(2.0.1)+cuda(11.8) with toolkit 12.2. It will then be able to use your A100 gpu to the fullest and also you'll be able to see progress bar for the training.

Also you can always watch your CPU, RAM or GPU usage if you are confused that training is being running or not.

senya-ashukha commented 1 year ago

TY @amangupta2303 !