chaofengc / PSFRGAN

PyTorch codes for "Progressive Semantic-Aware Style Transformation for Blind Face Restoration", CVPR2021
Other
370 stars 68 forks source link

Error in training #32

Closed LorenzoAgnolucci closed 3 years ago

LorenzoAgnolucci commented 3 years ago

Hello,

I am trying to train the model from scratch on a custom dataset.

When I run the command:

python train.py --gpus 2 --model enhance --name scratch --g_lr 0.0001 --d_lr 0.0004 --beta1 0.5 --gan_mode 'hinge' --lambda_pix 10 --lambda_fm 10 --lambda_ss 1000 --Dinput_nc 22 --D_num 3 --n_layers_D 4 --batch_size 1 --dataset ffhq --dataroot original_test/ --visual_freq 100 --print_freq 10

I get this error:

` ----------------- Options --------------- D_num: 3
Dinput_nc: 22 [default: 3] Dnorm: in
Gin_size: 512 [default: 512] Gnorm: spade
Gout_size: 512 [default: 512] Pimg_size: 512 [default: 512] Pnorm: bn
batch_size: 1 [default: 16] beta1: 0.5
checkpoints_dir: ./check_points
continue_train: False
crop_size: 256
d_lr: 0.0004
data_device: cuda:1 [default: None] dataroot: original_test/ [default: None] dataset_name: ffhq [default: single] debug: False
device: cuda:0 [default: None] epoch: latest
epoch_count: 1
g_lr: 0.0001
gan_mode: hinge
gpu_ids: [0, 1] [default: None] gpus: 2 [default: 1] init_gain: 0.02
init_type: normal
input_nc: 3
isTrain: True [default: None] lambda_fm: 10.0
lambda_g: 1.0
lambda_pcp: 0.0
lambda_pix: 10.0
lambda_ss: 1000.0
load_iter: 0 [default: 0] load_size: 512
lr: 0.0002
lr_decay_gamma: 1
lr_decay_iters: 50
lr_policy: step
max_dataset_size: inf
model: enhance
n_epochs: 100
n_epochs_decay: 100
n_layers_D: 4
name: scratch [default: experiment_name] ndf: 64
ngf: 64
niter_decay: 100
no_flip: False
no_strict_load: False
num_threads: 8
output_nc: 3
parse_net_weight: ./pretrain_models/parse_multi_iter_90000.pth phase: train
preprocess: none
print_freq: 10 [default: 100] resume_epoch: 0
resume_iter: 0
save_by_iter: False
save_epoch_freq: 5
save_iter_freq: 5000
save_latest_freq: 500
seed: 123
serial_batches: False
suffix:
total_epochs: 50
verbose: False
visual_freq: 100 [default: 400] ----------------- End ------------------- dataset [FFHQDataset] was created The number of training images = 2513 initialize network with normal model [EnhanceModel] was created ---------- Networks initialized ------------- [Network G] Total number of parameters : 45.957 M [Network D] Total number of parameters : 18.872 M

Start training from epoch: 00000; iter: 0000000 /usr/bin/nvidia-modprobe: unrecognized option: "-s"

ERROR: Invalid commandline, please run /usr/bin/nvidia-modprobe --help for usage information.

/usr/bin/nvidia-modprobe: unrecognized option: "-s"

ERROR: Invalid commandline, please run /usr/bin/nvidia-modprobe --help for usage information.

/usr/bin/nvidia-modprobe: unrecognized option: "-s"

ERROR: Invalid commandline, please run /usr/bin/nvidia-modprobe --help for usage information.

/usr/bin/nvidia-modprobe: unrecognized option: "-s"

ERROR: Invalid commandline, please run /usr/bin/nvidia-modprobe --help for usage information.

Traceback (most recent call last): File "train.py", line 78, in train(opt) File "train.py", line 39, in train model.forward(), timer.update_time('Forward') File "/homes/placeholder/PSFR-GAN/models/enhance_model.py", line 93, in forward self.real_D_results = self.netD(torch.cat((self.img_HR, self.hr_mask), dim=1), return_feat=True) File "/homes/placeholder/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, kwargs) File "/homes/placeholder/miniconda3/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/homes/placeholder/miniconda3/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/homes/placeholder/miniconda3/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply output.reraise() File "/homes/placeholder/miniconda3/lib/python3.8/site-packages/torch/_utils.py", line 395, in reraise raise self.exc_type(msg) TypeError: Caught TypeError in replica 1 on device 1. Original Traceback (most recent call last): File "/homes/placeholder/miniconda3/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker output = module(*input, *kwargs) File "/homes/placeholder/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(input, kwargs) TypeError: forward() missing 1 required positional argument: 'input' `

I am using torch==1.5.1 and torchvision==0.6.1. Could you please help me?

chaofengc commented 3 years ago

It seems that there are something wrong with your GPU. Please make sure that pytorch with GPU runs correctly on your device.

LorenzoAgnolucci commented 3 years ago

You are right, there was something wrong with my GPU indeed. Thanks!