JiahuiYu / generative_inpainting

DeepFill v1/v2 with Contextual Attention and Gated Convolution, CVPR 2018, and ICCV 2019 Oral
http://jiahuiyu.com/deepfill/
Other
3.26k stars 784 forks source link

Training on own dataset #488

Open akanksh2kb opened 3 years ago

akanksh2kb commented 3 years ago

Can someone help me with training? I need to know folder hierarchy of dataset. And should there be masks in one folder?

While training I am not getting any error but training not at all happening

Tried giving input images as 256X 256

If I will know the steps to train, it will be really helpful, as I am stuck ################################ I edited inpaint.yml file for my data :

=========================== Basic Settings ===========================

machine info

num_gpus_per_job: 1 # number of gpus each job need num_cpus_per_job: 4 # number of gpus each job need num_hosts_per_job: 1 memory_per_job: 32 # number of gpus each job need gpu_type: 'nvidia-tesla-p100'

parameters

name: places2_gated_conv_v100 # any name model_restore: '' # logs/places2_gated_conv dataset: 'peak' # 'tmnist', 'dtd', 'places2', 'celeba', 'imagenet', 'cityscapes' random_crop: False # Set to false when dataset is 'celebahq', meaning only resize the images to img_shapes, instead of crop img_shapes from a larger raw image. This is useful when you train on images with different resolutions like places2. In these cases, please set random_crop to true. val: False # true if you want to view validation results in tensorboard log_dir: logs/full_model_celeba_hq_256

gan: 'sngan' gan_loss_alpha: 1 gan_with_mask: True discounted_mask: True random_seed: False padding: 'SAME'

training

train_spe: 4000 max_iters: 100000000 viz_max_out: 10 val_psteps: 2000

data

data_flist:

https://github.com/jiahuiyu/progressive_growing_of_gans_tf

celebahq: [ 'data/celeba_hq/train_shuffled.flist', 'data/celeba_hq/validation_static_view.flist' ]

http://mmlab.ie.cuhk.edu.hk/projects/celeba.html, please to use random_crop: True

celeba: [ 'data/celeba/train_shuffled.flist', 'data/celeba/validation_static_view.flist' ]

http://places2.csail.mit.edu/, please download the high-resolution dataset and use random_crop: True

places2: [ 'data/places2/train_shuffled.flist', 'data/places2/validation_static_view.flist' ]

http://www.image-net.org/, please use random_crop: True

imagenet: [ 'data/imagenet/train_shuffled.flist', 'data/imagenet/validation_static_view.flist', ] peak: [ 'data/peak/train_shuffled.flist', 'data/peak/validation_shuffled.flist', ]

static_view_size: 30 img_shapes: [256, 256, 3] height: 128 width: 128 max_delta_height: 32 max_delta_width: 32 batch_size: 16 vertical_margin: 0 horizontal_margin: 0

loss

ae_loss: True l1_loss: True l1_loss_alpha: 1.

to tune

guided: False edge_threshold: 0.6 #################################

Thanks, akanksh

akanksh2kb commented 3 years ago

Training Not proceeding after this: Trigger callback: Total counts of trainable weights: 9999294. Total size of trainable weights: 0G 9M 548K 958B (Assuming32-bit data type.) 2021-01-19 14:29:35.994320: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0

akanksh2kb commented 3 years ago

Data folder hierarchy: data-->peak--> ['flist.sh' gen_flist.py 'train_shuffled.flist' 'training_data' 'validation_shuffled.flist']

training_data --> ['training' 'validation']

gourango-modak commented 3 years ago

I am facing the same issue too. Please help us?

Dominic-ZZ commented 3 years ago

same issue tf-gpu1.6.0


- weight name: discriminator/sn_patch_gan/conv6/kernel:0, shape: [5, 5, 256, 256], size: 1638400
- weight name: discriminator/sn_patch_gan/conv6/bias:0, shape: [256], size: 256
Trigger callback: Total counts of trainable weights: 9999294.
Total size of trainable weights: 0G 9M 548K 958B (Assuming32-bit data type.)
jiao0805 commented 3 years ago

If you are stuck after the [Trigger callback: Total counts of trainable weights: 9999294. Total size of trainable weights: 0G 9M 548K 958B (Assuming32-bit data type.)] without receiving extra error messages, it is probably working well without outputting anything.

Please check the parameters inside "inpaint.yml" file train_spe: 4000.
val_psteps: 2000 train_spe controls how often the checkpoint is saved. val_psteps controls how often the tensorboard records.

If you are training on only one GPU, then setting train_spe to 4000 and val_psteps to 2000 takes really long time before you can see any output information. In my case, it took 2 hours to 4000 train_spe on my 1080Ti.

So maybe you should set as follows to see what happens: train_spe: 4
val_psteps: 10

It works for me! GOOD LUCK!

CCC0621708 commented 3 years ago

I started after training.Traceback(most recent call last) error is reported long after CUDA has successfully loaded.What should I do

CCC0621708 commented 3 years ago
Demo122 commented 3 years ago

Training Not proceeding after this: Trigger callback: Total counts of trainable weights: 9999294. Total size of trainable weights: 0G 9M 548K 958B (Assuming32-bit data type.) 2021-01-19 14:29:35.994320: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0

I also encountered it, did you solve it?

Demo122 commented 3 years ago

same issue tf-gpu1.6.0

- weight name: discriminator/sn_patch_gan/conv6/kernel:0, shape: [5, 5, 256, 256], size: 1638400
- weight name: discriminator/sn_patch_gan/conv6/bias:0, shape: [256], size: 256
Trigger callback: Total counts of trainable weights: 9999294.
Total size of trainable weights: 0G 9M 548K 958B (Assuming32-bit data type.)

I also encountered it,did you solve it?please help me

vonchenplus commented 3 years ago

It takes 2000 steps to save the summary, so please be patient, Or you can open the log to see the training process.