NVlabs / stylegan2-ada-pytorch

StyleGAN2-ADA - Official PyTorch implementation
https://arxiv.org/abs/2006.06676
Other
4.01k stars 1.15k forks source link

Mapping module has only 2 layers? #151

Open gegogi opened 2 years ago

gegogi commented 2 years ago

Describe the bug I am trying to convert my model to a format which is compatible with sefa (https://github.com/genforce/sefa) layer naming convention. During the conversion process, I found my model has only 2 layers in the latent space mapping module (z->w). To my knowledge from the paper, this is supposed to be 8 and thus conversion fails. I made my model by transfer learning started from pretrained/transfer-learning-source-nets/ffhq-res256-mirror-paper256-noaug.pkl which itself has 8 mapping layers as expected. So it is strange why my model has only 2 mapping layers instead of 8 although it is derived from there. You can see there are only two fc* layers below.

(base) C:\Users\kyungk\Projects\genforce>python convert_model.py stylegan2ada_pth --source_model_path=..\stylegan2-ada-pytorch\trained\female_face_256_3.pkl --verbose_log --save_test_image
========================================
Loading source weights from `..\stylegan2-ada-pytorch\trained\female_face_256_3.pkl` ...
Successfully loaded!
--------------------
Converting source weights (G) to target ...
G_vars.keys()=['mapping.fc0.bias', 'mapping.fc0.weight', 'mapping.fc1.bias', 'mapping.fc1.weight', 'mapping.w_avg', 'synthesis.b128.conv0.affine.bias', 'synthesis.b128.conv0.affine.weight', 'synthesis.b128.conv0.bias', 'synthesis.b128.conv0.noise_const', 'synthesis.b128.conv0.noise_strength', 'synthesis.b128.conv0.resample_filter', 'synthesis.b128.conv0.weight', 'synthesis.b128.conv1.affine.bias', 'synthesis.b128.conv1.affine.weight', 'synthesis.b128.conv1.bias', 'synthesis.b128.conv1.noise_const', 'synthesis.b128.conv1.noise_strength', 'synthesis.b128.conv1.resample_filter', 'synthesis.b128.conv1.weight', 'synthesis.b128.resample_filter', 'synthesis.b128.torgb.affine.bias', 'synthesis.b128.torgb.affine.weight', 'synthesis.b128.torgb.bias', 'synthesis.b128.torgb.weight', 'synthesis.b16.conv0.affine.bias', 'synthesis.b16.conv0.affine.weight', 'synthesis.b16.conv0.bias', 'synthesis.b16.conv0.noise_const', 'synthesis.b16.conv0.noise_strength', 'synthesis.b16.conv0.resample_filter', 'synthesis.b16.conv0.weight', 'synthesis.b16.conv1.affine.bias', 'synthesis.b16.conv1.affine.weight', 'synthesis.b16.conv1.bias', 'synthesis.b16.conv1.noise_const', 'synthesis.b16.conv1.noise_strength', 'synthesis.b16.conv1.resample_filter', 'synthesis.b16.conv1.weight', 'synthesis.b16.resample_filter', 'synthesis.b16.torgb.affine.bias', 'synthesis.b16.torgb.affine.weight', 'synthesis.b16.torgb.bias', 'synthesis.b16.torgb.weight', 'synthesis.b256.conv0.affine.bias', 'synthesis.b256.conv0.affine.weight', 'synthesis.b256.conv0.bias', 'synthesis.b256.conv0.noise_const', 'synthesis.b256.conv0.noise_strength', 'synthesis.b256.conv0.resample_filter', 'synthesis.b256.conv0.weight', 'synthesis.b256.conv1.affine.bias', 'synthesis.b256.conv1.affine.weight', 'synthesis.b256.conv1.bias', 'synthesis.b256.conv1.noise_const', 'synthesis.b256.conv1.noise_strength', 'synthesis.b256.conv1.resample_filter', 'synthesis.b256.conv1.weight', 'synthesis.b256.resample_filter', 'synthesis.b256.torgb.affine.bias', 'synthesis.b256.torgb.affine.weight', 'synthesis.b256.torgb.bias', 'synthesis.b256.torgb.weight', 'synthesis.b32.conv0.affine.bias', 'synthesis.b32.conv0.affine.weight', 'synthesis.b32.conv0.bias', 'synthesis.b32.conv0.noise_const', 'synthesis.b32.conv0.noise_strength', 'synthesis.b32.conv0.resample_filter', 'synthesis.b32.conv0.weight', 'synthesis.b32.conv1.affine.bias', 'synthesis.b32.conv1.affine.weight', 'synthesis.b32.conv1.bias', 'synthesis.b32.conv1.noise_const', 'synthesis.b32.conv1.noise_strength', 'synthesis.b32.conv1.resample_filter', 'synthesis.b32.conv1.weight', 'synthesis.b32.resample_filter', 'synthesis.b32.torgb.affine.bias', 'synthesis.b32.torgb.affine.weight', 'synthesis.b32.torgb.bias', 'synthesis.b32.torgb.weight', 'synthesis.b4.const', 'synthesis.b4.conv1.affine.bias', 'synthesis.b4.conv1.affine.weight', 'synthesis.b4.conv1.bias', 'synthesis.b4.conv1.noise_const', 'synthesis.b4.conv1.noise_strength', 'synthesis.b4.conv1.resample_filter', 'synthesis.b4.conv1.weight', 'synthesis.b4.resample_filter', 'synthesis.b4.torgb.affine.bias', 'synthesis.b4.torgb.affine.weight', 'synthesis.b4.torgb.bias', 'synthesis.b4.torgb.weight', 'synthesis.b64.conv0.affine.bias', 'synthesis.b64.conv0.affine.weight', 'synthesis.b64.conv0.bias', 'synthesis.b64.conv0.noise_const', 'synthesis.b64.conv0.noise_strength', 'synthesis.b64.conv0.resample_filter', 'synthesis.b64.conv0.weight', 'synthesis.b64.conv1.affine.bias', 'synthesis.b64.conv1.affine.weight', 'synthesis.b64.conv1.bias', 'synthesis.b64.conv1.noise_const', 'synthesis.b64.conv1.noise_strength', 'synthesis.b64.conv1.resample_filter', 'synthesis.b64.conv1.weight', 'synthesis.b64.resample_filter', 'synthesis.b64.torgb.affine.bias', 'synthesis.b64.torgb.affine.weight', 'synthesis.b64.torgb.bias', 'synthesis.b64.torgb.weight', 'synthesis.b8.conv0.affine.bias', 'synthesis.b8.conv0.affine.weight', 'synthesis.b8.conv0.bias', 'synthesis.b8.conv0.noise_const', 'synthesis.b8.conv0.noise_strength', 'synthesis.b8.conv0.resample_filter', 'synthesis.b8.conv0.weight', 'synthesis.b8.conv1.affine.bias', 'synthesis.b8.conv1.affine.weight', 'synthesis.b8.conv1.bias', 'synthesis.b8.conv1.noise_const', 'synthesis.b8.conv1.noise_strength', 'synthesis.b8.conv1.resample_filter', 'synthesis.b8.conv1.weight', 'synthesis.b8.resample_filter', 'synthesis.b8.torgb.affine.bias', 'synthesis.b8.torgb.affine.weight', 'synthesis.b8.torgb.bias', 'synthesis.b8.torgb.weight']
tf_var_name=Dense0/weight
dst_var_name=Dense0/weight
    Converting `mapping.fc0.weight` to `mapping.dense0.weight`.
tf_var_name=Dense0/bias
dst_var_name=Dense0/bias
    Converting `mapping.fc0.bias` to `mapping.dense0.bias`.
tf_var_name=Dense1/weight
dst_var_name=Dense1/weight
    Converting `mapping.fc1.weight` to `mapping.dense1.weight`.
tf_var_name=Dense1/bias
dst_var_name=Dense1/bias
    Converting `mapping.fc1.bias` to `mapping.dense1.bias`.
tf_var_name=Dense2/weight
Traceback (most recent call last):
  File "convert_model.py", line 77, in <module>
    main()
  File "convert_model.py", line 66, in main
    convert_stylegan2ada_pth_weight(src_weight_path=args.source_model_path,
  File "C:\Users\kyungk\Projects\genforce\converters\stylegan2ada_pth_converter.py", line 189, in convert_stylegan2ada_pth_weight
    assert tf_var_name in official_tf_to_pth_var_mapping
AssertionError
PDillis commented 2 years ago

If you used the 'auto' config, it will use 2 mapping layers by default. This can be seen at the beginning of training when you transferred from FFHQ256, as the whole images must've been 'pinkish' and with some weird expressions, as the code was adapting the 8 mapping layers of the pretrained model into the 2 mapping layers of the new one you trained.

MoemaMike commented 2 years ago

I also noticed today that the Auto config seems to be hard coded to have generator layers = 2. I thought that the Auto spec values were all dynamically configured but this seems not to be the case for layers? Wondering why , seems like a documentation error at least as i was under the impression that auto was recommended for custom datasets and all the config values would be dynamically determined if auto was chosen. ... As i am training custom datasets i have been exclusively using Auto. Now that i have noticed this i am wondering if my model could benefit from a higher number of mapping layers ... It seems there is no specific override for the layers spec value, i would either have to use one of the non auto configs or modify a local copy of the code to expose layers specifically, is that correct? I can do that if the extra layers would be of benefit , Any guideline on the benefit / non benefit of extra mapping layers? Fwiw i am training custom dataset with resolution 2048x2048

If i decide to add mapping layers should i start training all over from scratch rather than resume training from a pkl snapshot created with only 2 layers?

MoemaMike commented 2 years ago

hmm ... so i just tried to change to the "Stylegan2" cfg which specifies 8 mapping layers but now i get an OOM on colab pro+ . Not sure if the OOM is because of the extra layers or because of some other side effect of switching from auto to stylegan config. sigh

File "/content/stylegan2-ada-pytorch/training/networks.py", line 168, in forward x = bias_act.bias_act(x, b, act=self.activation, gain=act_gain, clamp=act_clamp) File "/content/stylegan2-ada-pytorch/torch_utils/ops/bias_act.py", line 88, in bias_act return _bias_act_cuda(dim=dim, act=act, alpha=alpha, gain=gain, clamp=clamp).apply(x, b) File "/content/stylegan2-ada-pytorch/torch_utils/ops/bias_act.py", line 153, in forward y = _plugin.bias_act(x, b, _null_tensor, _null_tensor, _null_tensor, 0, dim, spec.cuda_idx, alpha, gain, clamp) RuntimeError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 15.78 GiB total capacity; 13.95 GiB already allocated; 214.75 MiB free; 14.12 GiB reserved in total by PyTorch)

MoemaMike commented 2 years ago

fwiw, i modified local copy so that "Auto" cfg is set to 8 and i did not (yet) get the OOM that i got with cfg="stylegan2" so the extra memory seems to be side effect of some other value in "stylegan2" cfg beside the map=8 value ... thankfullly

PDillis commented 2 years ago

cfg='stylegan2' is a bit of a hard setting, as it also has mb=32, which is a lot more compared to the auto setting. The latter most likely gets you mb=2 due to your image size, unless you modified that part. The original StyleGAN paper tested with different mapping layers (Table 4), and their results are generally better with 8, but it's not always the case (also remember, this is an academic paper and they wish to get the SOTA results, so the difference is actually negligible for regular users). When I tried with 2 mapping layers, the final model was more 'expressive' than the one with 8 mapping layers, but again this could've been a fluke.

I've seen Aydao use 4 mapping layers with 1024 neurons each which, among other things, give greater capacity to the final model than the vanilla network. You can read a summary of the changes in Gwern's blog. In any case, 2 mapping layers should be fine, so long as you are happy with the final model.

MoemaMike commented 2 years ago

thanks ... i will experiment with various mapping. I wonder what you mean by the 2 layer model being more expressive? As in more painterly, lest photorealistic perhaps? Anyway, i will see how they compare on my datasets. As for the 4 mapping with 1024, i assume you mean instead of the 512 that is here? As in 4 layers of 1024 as opposed to 2-8 of 512?

mapping.fc0 262656 - [2, 512] float32 mapping.fc1 262656 - [2, 512] float32

MoemaMike commented 2 years ago

@PDillis An side , looking at Aydao 's page i see he has a mod called stylegan surgery which says it supports non-square images, which is of interest to me. But it seems that must be for tensorflow not pytorch, Are you aware of any stylegan2-ada-pytorch mods for non-square aspect ratios support?

[EDIT] found this mod which may do the non-square images https://pythonrepo.com/repo/eps696-stylegan2ada

PDillis commented 2 years ago

Many don't train a rectangular model; instead, what is usually more common is to resize the images to a square, train the model, generate new images, and then just resize these images into the non-square format you had. This way, you avoid having to train a larger model (which is slower), plus the resulting images are still quite good. If you still want a rectangular model, Vadim Epstein's repository is a good resource.

Regarding a more 'expressive' model, remember that GANs tend to better learn the 'mean' image in your dataset. What I meant was that the one with 2 FC layers ended up being able to generate more modes or different types of images compared to an 8 FC one, but again, it might've been a fluke. Your experiments will tell you which model you prefer in the end. If you want to test with 4 FC layers with 1024 neurons each, for example, you do it here:

    args.G_kwargs.mapping_kwargs.num_layers = 4
    args.G_kwargs.mapping_kwargs.layer_features = 1024

(you can also modify the latent dimension of Z and W, number of layers in W, etc., which mapping_kwargs take; the full list of kwargs can be found here).

MoemaMike commented 2 years ago

very good, thank you Diego

johndpope commented 2 years ago

for rectangle - https://github.com/eps696/stylegan2ada non-square aspect ratio support (auto-picked from dataset; resolution must be divisible by 2**n, such as 512x256, 1280x768, etc.)