Having problems training the model

adeptflax commented 3 years ago

I'm trying to train a 512x512 faceswap model. I trained a 512x512 first order model for faces. More info about it here. I got "RuntimeError: The size of tensor a (29) must match the size of tensor b (13) at non-singleton dimension 3" in at here for the segmentation model: https://github.com/AliaksandrSiarohin/motion-cosegmentation/blob/c1a71a778aee67c5265cafc34cb48856dfae8829/logger.py#L22 I put a "try except" in the part that didn't copy. Only one layer didn't copy. I know now that didn't work as the segmentation model didn't seem to train. It seemed to me that it might work so I just tried it. I looked that segmentation map of a face and it was just black.

I modified the code loaded "reconstruction_module" as "generator" and I used the face parser and I had to resize it to 128x128. This is the results of an image I did face swap on epoch 7. Screenshot (7)

What should I do?

Also to the creator, could you add a licence to the project?

AliaksandrSiarohin commented 3 years ago

The config you are using is not correct. Number of keypoints between first order model cpk and this should be the same.

adeptflax commented 3 years ago

the visualizer_params.kp_size in the cosegmentation config was 5 and model_params.common_params.num_kp was 10. So it seems you are correct. I'll try it again.

AliaksandrSiarohin commented 3 years ago

Visualizer kp does not matter, send me the your configs and full error trace.

adeptflax commented 3 years ago

First order model:

dataset_params:
  root_dir: ../vox-out
  frame_shape: [512, 512, 3]
  id_sampling: True
  augmentation_params:
    flip_param:
      horizontal_flip: True
      time_flip: True
    jitter_param:
      brightness: 0.1
      contrast: 0.1
      saturation: 0.1
      hue: 0.1

model_params:
  common_params:
    num_kp: 10
    num_channels: 3
    estimate_jacobian: True
  kp_detector_params:
     temperature: 0.1
     block_expansion: 32
     max_features: 1024
     scale_factor: 0.125
     num_blocks: 5
  generator_params:
    block_expansion: 64
    max_features: 512
    num_down_blocks: 2
    num_bottleneck_blocks: 6
    estimate_occlusion_map: True
    dense_motion_params:
      block_expansion: 64
      max_features: 1024
      num_blocks: 5
      scale_factor: 0.125
  discriminator_params:
    scales: [1]
    block_expansion: 32
    max_features: 512
    num_blocks: 4
    sn: True

train_params:
  num_epochs: 100
  num_repeats: 75
  epoch_milestones: [60, 90]
  lr_generator: 2.0e-4
  lr_discriminator: 2.0e-4
  lr_kp_detector: 2.0e-4
  batch_size: 6
  scales: [1, 0.5, 0.25, 0.125]
  checkpoint_freq: 5
  transform_params:
    sigma_affine: 0.05
    sigma_tps: 0.005
    points_tps: 5
  loss_weights:
    generator_gan: 0
    discriminator_gan: 1
    feature_matching: [10, 10, 10, 10]
    perceptual: [10, 10, 10, 10, 10]
    equivariance_value: 10
    equivariance_jacobian: 10

reconstruction_params:
  num_videos: 1000
  format: '.mp4'

animate_params:
  num_pairs: 50
  format: '.mp4'
  normalization_params:
    adapt_movement_scale: False
    use_relative_movement: True
    use_relative_jacobian: True

visualizer_params:
  kp_size: 5
  draw_border: True
  colormap: 'gist_rainbow'

motion-cosegmentation:

# Dataset parameters
dataset_params:
  # Path to data, data can be stored in several formats: .mp4 or .gif videos, stacked .png images or folders with frames.
  root_dir: ../vox-out
  # Image shape, needed for staked .png format.
  image_shape: [512, 512, 3]
  # In case of VoxCeleb or TaiChi single video can be splitted in many chunks, or the maybe several videos for single person.
  # In this case epoch can be a pass over different videos (if id_sampling=True) or over different chunks (if id_sampling=False)
  # If the name of the video '12335#adsbf.mp4' the id is assumed to be 12335
  id_sampling: True
  # Augmentation parameters see augmentation.py for all posible augmentations
  augmentation_params:
    flip_param:
      horizontal_flip: True
      time_flip: True
    jitter_param:
      brightness: 0.1
      contrast: 0.1
      saturation: 0.1
      hue: 0.1

# Defines model architecture
model_params:
  common_params:
    # Number of segments
    num_segments: 10
    # Number of channels per image
    num_channels: 3
    # Use only shift and no-affine part
    estimate_affine_part: True
  segmentation_module_params:
     # Softmax temperature for shift heatmaps
     temperature: 0.1
     # Number of features mutliplier
     block_expansion: 32
     # Maximum allowed number of features
     max_features: 1024
     # Number of block in Unet. Can be increased or decreased depending or resolution.
     num_blocks: 5
     # Segmentations is predicted on smaller images for better performance,
     # scale_factor=0.25 means that 256x256 image will be resized to 64x64
     scale_factor: 0.125
  reconstruction_module_params:
    # Number of features mutliplier
    block_expansion: 64
    # Maximum allowed number of features
    max_features: 512
    # Number of downsampling blocks in Jonson architecture.
    # Can be increased or decreased depending or resolution.
    num_down_blocks: 2
    # Number of ResBlocks  in Jonson architecture.
    num_bottleneck_blocks: 6
    # Use visibility map or not
    estimate_visibility: True

# Parameters of training
train_params:
  num_workers: 2
  # Number of training epochs
  num_epochs: 20
  # For better i/o performance when number of videos is small number of epochs can be multiplied by this number.
  # Thus effectively with num_repeats=100 each epoch is 100 times larger.
  num_repeats: 50
  # Learning rates
  lr_segmentation: 2.0e-4
  lr_reconstruction: 2.0e-4
  lr_reconstruction_module: 2.0e-4
  lr_segmentation_module: 2.0e-4

  batch_size: 6
  # Scales for perceptual pyramide loss. If scales = [1, 0.5, 0.25, 0.125] and image resolution is 256x256,
  # than the loss will be computer on resolutions 256x256, 128x128, 64x64, 32x32.
  scales: [1, 0.5, 0.25, 0.125]
  # Save checkpoint this frequently. If checkpoint_freq=50, checkpoint will be saved every 50 epochs.
  checkpoint_freq: 1
  # Parameters of transform for equivariance loss
  transform_params:
    # Sigma for affine part
    sigma_affine: 0.05
    # Sigma for deformation part
    sigma_tps: 0.005
    # Number of point in the deformation grid
    points_tps: 5
  loss_weights:
    equivariance: 10
    perceptual: [10, 10, 10, 10, 10]

# Visualization parameters
visualizer_params:
  # Draw keypoints (shifts in affine transformations) of this size, increase or decrease depending on resolution
  kp_size: 5
  # Draw white border around images
  draw_border: True
  # Color map for keypoints
  colormap: 'gist_rainbow'

AliaksandrSiarohin commented 3 years ago

And error trace?

adeptflax commented 3 years ago

I didn't keep the logs of it. I'll rerun the code shortly and post them here.

adeptflax commented 3 years ago

I added print('name,' name) before the statement that errors. It errored on the "down.weight" layer.

def partial_state_dict_load(module, state_dict):
    own_state = module.state_dict()
    for name, param in state_dict.items():
        if name not in own_state:
            continue

        if isinstance(param, torch.nn.Parameter):
            # backwards compatibility for serialized parameters
            param = param.data
        print('name', name)
        own_state[name].copy_(param)

train.py:85: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  config = yaml.load(f)
Use predefined train-test split.
Training...
name first.conv.weight
name first.conv.bias
name first.norm.weight
name first.norm.bias
name first.norm.running_mean
name first.norm.running_var
name first.norm.num_batches_tracked
name down_blocks.0.conv.weight
name down_blocks.0.conv.bias
name down_blocks.0.norm.weight
name down_blocks.0.norm.bias
name down_blocks.0.norm.running_mean
name down_blocks.0.norm.running_var
name down_blocks.0.norm.num_batches_tracked
name down_blocks.1.conv.weight
name down_blocks.1.conv.bias
name down_blocks.1.norm.weight
name down_blocks.1.norm.bias
name down_blocks.1.norm.running_mean
name down_blocks.1.norm.running_var
name down_blocks.1.norm.num_batches_tracked
name up_blocks.0.conv.weight
name up_blocks.0.conv.bias
name up_blocks.0.norm.weight
name up_blocks.0.norm.bias
name up_blocks.0.norm.running_mean
name up_blocks.0.norm.running_var
name up_blocks.0.norm.num_batches_tracked
name up_blocks.1.conv.weight
name up_blocks.1.conv.bias
name up_blocks.1.norm.weight
name up_blocks.1.norm.bias
name up_blocks.1.norm.running_mean
name up_blocks.1.norm.running_var
name up_blocks.1.norm.num_batches_tracked
name bottleneck.r0.conv1.weight
name bottleneck.r0.conv1.bias
name bottleneck.r0.conv2.weight
name bottleneck.r0.conv2.bias
name bottleneck.r0.norm1.weight
name bottleneck.r0.norm1.bias
name bottleneck.r0.norm1.running_mean
name bottleneck.r0.norm1.running_var
name bottleneck.r0.norm1.num_batches_tracked
name bottleneck.r0.norm2.weight
name bottleneck.r0.norm2.bias
name bottleneck.r0.norm2.running_mean
name bottleneck.r0.norm2.running_var
name bottleneck.r0.norm2.num_batches_tracked
name bottleneck.r1.conv1.weight
name bottleneck.r1.conv1.bias
name bottleneck.r1.conv2.weight
name bottleneck.r1.conv2.bias
name bottleneck.r1.norm1.weight
name bottleneck.r1.norm1.bias
name bottleneck.r1.norm1.running_mean
name bottleneck.r1.norm1.running_var
name bottleneck.r1.norm1.num_batches_tracked
name bottleneck.r1.norm2.weight
name bottleneck.r1.norm2.bias
name bottleneck.r1.norm2.running_mean
name bottleneck.r1.norm2.running_var
name bottleneck.r1.norm2.num_batches_tracked
name bottleneck.r2.conv1.weight
name bottleneck.r2.conv1.bias
name bottleneck.r2.conv2.weight
name bottleneck.r2.conv2.bias
name bottleneck.r2.norm1.weight
name bottleneck.r2.norm1.bias
name bottleneck.r2.norm1.running_mean
name bottleneck.r2.norm1.running_var
name bottleneck.r2.norm1.num_batches_tracked
name bottleneck.r2.norm2.weight
name bottleneck.r2.norm2.bias
name bottleneck.r2.norm2.running_mean
name bottleneck.r2.norm2.running_var
name bottleneck.r2.norm2.num_batches_tracked
name bottleneck.r3.conv1.weight
name bottleneck.r3.conv1.bias
name bottleneck.r3.conv2.weight
name bottleneck.r3.conv2.bias
name bottleneck.r3.norm1.weight
name bottleneck.r3.norm1.bias
name bottleneck.r3.norm1.running_mean
name bottleneck.r3.norm1.running_var
name bottleneck.r3.norm1.num_batches_tracked
name bottleneck.r3.norm2.weight
name bottleneck.r3.norm2.bias
name bottleneck.r3.norm2.running_mean
name bottleneck.r3.norm2.running_var
name bottleneck.r3.norm2.num_batches_tracked
name bottleneck.r4.conv1.weight
name bottleneck.r4.conv1.bias
name bottleneck.r4.conv2.weight
name bottleneck.r4.conv2.bias
name bottleneck.r4.norm1.weight
name bottleneck.r4.norm1.bias
name bottleneck.r4.norm1.running_mean
name bottleneck.r4.norm1.running_var
name bottleneck.r4.norm1.num_batches_tracked
name bottleneck.r4.norm2.weight
name bottleneck.r4.norm2.bias
name bottleneck.r4.norm2.running_mean
name bottleneck.r4.norm2.running_var
name bottleneck.r4.norm2.num_batches_tracked
name bottleneck.r5.conv1.weight
name bottleneck.r5.conv1.bias
name bottleneck.r5.conv2.weight
name bottleneck.r5.conv2.bias
name bottleneck.r5.norm1.weight
name bottleneck.r5.norm1.bias
name bottleneck.r5.norm1.running_mean
name bottleneck.r5.norm1.running_var
name bottleneck.r5.norm1.num_batches_tracked
name bottleneck.r5.norm2.weight
name bottleneck.r5.norm2.bias
name bottleneck.r5.norm2.running_mean
name bottleneck.r5.norm2.running_var
name bottleneck.r5.norm2.num_batches_tracked
name final.weight
name final.bias
name predictor.encoder.down_blocks.0.conv.weight
name predictor.encoder.down_blocks.0.conv.bias
name predictor.encoder.down_blocks.0.norm.weight
name predictor.encoder.down_blocks.0.norm.bias
name predictor.encoder.down_blocks.0.norm.running_mean
name predictor.encoder.down_blocks.0.norm.running_var
name predictor.encoder.down_blocks.0.norm.num_batches_tracked
name predictor.encoder.down_blocks.1.conv.weight
name predictor.encoder.down_blocks.1.conv.bias
name predictor.encoder.down_blocks.1.norm.weight
name predictor.encoder.down_blocks.1.norm.bias
name predictor.encoder.down_blocks.1.norm.running_mean
name predictor.encoder.down_blocks.1.norm.running_var
name predictor.encoder.down_blocks.1.norm.num_batches_tracked
name predictor.encoder.down_blocks.2.conv.weight
name predictor.encoder.down_blocks.2.conv.bias
name predictor.encoder.down_blocks.2.norm.weight
name predictor.encoder.down_blocks.2.norm.bias
name predictor.encoder.down_blocks.2.norm.running_mean
name predictor.encoder.down_blocks.2.norm.running_var
name predictor.encoder.down_blocks.2.norm.num_batches_tracked
name predictor.encoder.down_blocks.3.conv.weight
name predictor.encoder.down_blocks.3.conv.bias
name predictor.encoder.down_blocks.3.norm.weight
name predictor.encoder.down_blocks.3.norm.bias
name predictor.encoder.down_blocks.3.norm.running_mean
name predictor.encoder.down_blocks.3.norm.running_var
name predictor.encoder.down_blocks.3.norm.num_batches_tracked
name predictor.encoder.down_blocks.4.conv.weight
name predictor.encoder.down_blocks.4.conv.bias
name predictor.encoder.down_blocks.4.norm.weight
name predictor.encoder.down_blocks.4.norm.bias
name predictor.encoder.down_blocks.4.norm.running_mean
name predictor.encoder.down_blocks.4.norm.running_var
name predictor.encoder.down_blocks.4.norm.num_batches_tracked
name predictor.decoder.up_blocks.0.conv.weight
name predictor.decoder.up_blocks.0.conv.bias
name predictor.decoder.up_blocks.0.norm.weight
name predictor.decoder.up_blocks.0.norm.bias
name predictor.decoder.up_blocks.0.norm.running_mean
name predictor.decoder.up_blocks.0.norm.running_var
name predictor.decoder.up_blocks.0.norm.num_batches_tracked
name predictor.decoder.up_blocks.1.conv.weight
name predictor.decoder.up_blocks.1.conv.bias
name predictor.decoder.up_blocks.1.norm.weight
name predictor.decoder.up_blocks.1.norm.bias
name predictor.decoder.up_blocks.1.norm.running_mean
name predictor.decoder.up_blocks.1.norm.running_var
name predictor.decoder.up_blocks.1.norm.num_batches_tracked
name predictor.decoder.up_blocks.2.conv.weight
name predictor.decoder.up_blocks.2.conv.bias
name predictor.decoder.up_blocks.2.norm.weight
name predictor.decoder.up_blocks.2.norm.bias
name predictor.decoder.up_blocks.2.norm.running_mean
name predictor.decoder.up_blocks.2.norm.running_var
name predictor.decoder.up_blocks.2.norm.num_batches_tracked
name predictor.decoder.up_blocks.3.conv.weight
name predictor.decoder.up_blocks.3.conv.bias
name predictor.decoder.up_blocks.3.norm.weight
name predictor.decoder.up_blocks.3.norm.bias
name predictor.decoder.up_blocks.3.norm.running_mean
name predictor.decoder.up_blocks.3.norm.running_var
name predictor.decoder.up_blocks.3.norm.num_batches_tracked
name predictor.decoder.up_blocks.4.conv.weight
name predictor.decoder.up_blocks.4.conv.bias
name predictor.decoder.up_blocks.4.norm.weight
name predictor.decoder.up_blocks.4.norm.bias
name predictor.decoder.up_blocks.4.norm.running_mean
name predictor.decoder.up_blocks.4.norm.running_var
name predictor.decoder.up_blocks.4.norm.num_batches_tracked
name down.weight
Traceback (most recent call last):
  File "train.py", line 110, in <module>
    train(config, reconstruction_module, segmentation_module, opt.checkpoint, log_dir, dataset, opt.device_ids)
  File "train.py", line 33, in train
    start_epoch = Logger.load_cpk(checkpoint, reconstruction_module, segmentation_module,
  File "/workspace/motion-cosegmentation/logger.py", line 94, in load_cpk
    load_segmentation_module(segmentation_module, checkpoint)
  File "/workspace/motion-cosegmentation/logger.py", line 35, in load_segmentation_module
    partial_state_dict_load(module, checkpoint['kp_detector'])
  File "/workspace/motion-cosegmentation/logger.py", line 23, in partial_state_dict_load
    own_state[name].copy_(param)
RuntimeError: The size of tensor a (29) must match the size of tensor b (13) at non-singleton dimension 3

AliaksandrSiarohin commented 3 years ago

Have you changed the code of antialiasing interpolation?

adeptflax commented 3 years ago

no

adeptflax commented 3 years ago

could there be a module version problem? I trained first order model using the latest version of everything and I'm running motion-cosegmentation using the latest version of everything.

adeptflax commented 3 years ago

I'll try to reproduce the problem in a python script.

adeptflax commented 3 years ago

Is it suppose to copy everything except the last Conv2d layer?

AliaksandrSiarohin commented 3 years ago

Everything should be copied, except self.segmentation layer.

AliaksandrSiarohin commented 3 years ago

"The training script crashed for some unknown reason right before completion but it mostly seems to work just fine. I changed the sigma to 1.5 as described here." Did you change sigma in motion segmentation code?

adeptflax commented 3 years ago

no

AliaksandrSiarohin commented 3 years ago

Yes or no? Try to change it then.

adeptflax commented 3 years ago

I'll try that

adeptflax commented 3 years ago

That worked! \O/

AliaksandrSiarohin / motion-cosegmentation

Having problems training the model #43