PDillis / stylegan3-fun

Modifications of the official PyTorch implementation of StyleGAN3. Let's easily generate images and videos with StyleGAN2/2-ADA/3!
Other
230 stars 36 forks source link

Bug in conditioning of discriminator? #29

Open DEBIHOOD opened 1 year ago

DEBIHOOD commented 1 year ago

I'm pretty sure i wouldn't get any support in official SG3 repo, because it all looks abandoned, the issues of this repo mostly remain silently unanswered. I noticed that you provided some community support by answering to some issues, for the users in it, i think this is important contribution for StyleGAN community, so props to you. I think this is the only place where this problem might be unraveled, and i thought you could shred some light on it.

Recently i've tried to train conditional model, and i'm super hyped about it, because i've been playing with SG for quite a while already, and it is the first time i was trying conditional model. The power that conditioning is able to provide is just super cool to me. Also, turned out SG supports multiple labels out of the box, which was quite unexpected for me, and i'm even more hyped to try that out.

Papers of SG/SG2/SG3 doesn't seems to have even a single word about conditioning, but the code has it. I was trying to find something related to it in the papers, but no luck.

Describe the bug Everything related to the bug is already described here: https://github.com/NVlabs/stylegan3/issues/209

Thanks a lot in advance.

PDillis commented 1 year ago

You're right, sadly, the conditional models haven't really been that well documented in the papers, they're more interested in the unconditional ones and their respective datasets.

Tied to this, to answer your question, all of the tests they've done are changing the values of the number of layers of the mapping network $f$ for the Generator (G.mapping). The lower value of 2 layers for $f$ comes from the StyleGAN2-ADA paper, page 35:

image

The StyleGAN2 paper basically said that everyone before them focused too much on the Discriminator, so that's why they mostly focused on the Generator.

Regarding the default value, this is set in the definition of the MappingNetwork here. When you set --map-depth=2, this is only passed to the Generator's mapping layer in train.py here. When defining the Discriminator, it will use only a mapping layer if a conditional model is being defined here, and thus will use the default value of 8.

So, if you want to test out different values for the number of mapping layers in a conditional Discriminator, then you need to add a command to do so in train.py (such as --map-depth-d or something). I can add it for you later today, but let me know if this answers the question you had.

DEBIHOOD commented 1 year ago

Thanks, no, huge thanks! I've got it working as it should behave! I wasn't expecting i would get it working, NOR get the answer why it doesn't by default, but here we are, in less than 24 hours, it was solved! You've made my day :) Here is a small one-line fix https://github.com/PDillis/stylegan3-fun/pull/30 I decided that the discriminator is better to have same number of conditional mapping layers as in generator, to make G and D more equal, cause numerous GAN findings(like l4rz's scaling up of the SG2) imply that 2 networks that cannot outperform each other is a key to equilibrium and good results.


You can see that now everything is correct:

Generator            Parameters  Buffers  Output shape      Datatype
---                  ---         ---      ---               ---
mapping.embed        3072        -        [64, 512]         float32
mapping.fc0          524800      -        [64, 512]         float32
mapping.fc1          262656      -        [64, 512]         float32
mapping              -           512      [64, 10, 512]     float32
synthesis.b4.conv1   10529       32       [64, 16, 4, 4]    float32
synthesis.b4.torgb   8259        -        [64, 3, 4, 4]     float32
synthesis.b4:0       256         16       [64, 16, 4, 4]    float32
synthesis.b4:1       -           -        [64, 3, 4, 4]     float32
synthesis.b8.conv0   10529       80       [64, 16, 8, 8]    float32
synthesis.b8.conv1   10529       80       [64, 16, 8, 8]    float32
synthesis.b8.torgb   8259        -        [64, 3, 8, 8]     float32
synthesis.b8:0       -           16       [64, 16, 8, 8]    float32
synthesis.b8:1       -           -        [64, 3, 8, 8]     float32
synthesis.b16.conv0  10529       272      [64, 16, 16, 16]  float32
synthesis.b16.conv1  10529       272      [64, 16, 16, 16]  float32
synthesis.b16.torgb  8259        -        [64, 3, 16, 16]   float32
synthesis.b16:0      -           16       [64, 16, 16, 16]  float32
synthesis.b16:1      -           -        [64, 3, 16, 16]   float32
synthesis.b32.conv0  10529       1040     [64, 16, 32, 32]  float32
synthesis.b32.conv1  10529       1040     [64, 16, 32, 32]  float32
synthesis.b32.torgb  8259        -        [64, 3, 32, 32]   float32
synthesis.b32:0      -           16       [64, 16, 32, 32]  float32
synthesis.b32:1      -           -        [64, 3, 32, 32]   float32
synthesis.b64.conv0  10529       4112     [64, 16, 64, 64]  float32
synthesis.b64.conv1  10529       4112     [64, 16, 64, 64]  float32
synthesis.b64.torgb  8259        -        [64, 3, 64, 64]   float32
synthesis.b64:0      -           16       [64, 16, 64, 64]  float32
synthesis.b64:1      -           -        [64, 3, 64, 64]   float32
---                  ---         ---      ---               ---
Total                926840      11632    -                 -

Discriminator  Parameters  Buffers  Output shape      Datatype
---            ---         ---      ---               ---
b64.fromrgb    64          16       [64, 16, 64, 64]  float32
b64.skip       256         16       [64, 16, 32, 32]  float32
b64.conv0      2320        16       [64, 16, 64, 64]  float32
b64.conv1      2320        16       [64, 16, 32, 32]  float32
b64            -           16       [64, 16, 32, 32]  float32
b32.skip       256         16       [64, 16, 16, 16]  float32
b32.conv0      2320        16       [64, 16, 32, 32]  float32
b32.conv1      2320        16       [64, 16, 16, 16]  float32
b32            -           16       [64, 16, 16, 16]  float32
b16.skip       256         16       [64, 16, 8, 8]    float32
b16.conv0      2320        16       [64, 16, 16, 16]  float32
b16.conv1      2320        16       [64, 16, 8, 8]    float32
b16            -           16       [64, 16, 8, 8]    float32
b8.skip        256         16       [64, 16, 4, 4]    float32
b8.conv0       2320        16       [64, 16, 8, 8]    float32
b8.conv1       2320        16       [64, 16, 4, 4]    float32
b8             -           16       [64, 16, 4, 4]    float32
mapping.embed  96          -        [64, 16]          float32
mapping.fc0    272         -        [64, 16]          float32
mapping.fc1    272         -        [64, 16]          float32
b4.mbstd       -           -        [64, 17, 4, 4]    float32
b4.conv        2464        16       [64, 16, 4, 4]    float32
b4.fc          4112        -        [64, 16]          float32
b4.out         272         -        [64, 16]          float32
b4             -           -        [64, 1]           float32
---            ---         ---      ---               ---
Total          27136       288      -                 -

I added a key argument --cond-D-nofix for backward compatibility, so that models that were trained before this fix, hence having wrong amount of conditioning mapping layers in discriminator, can be resumed for more training without any issue. But, one interesting finding, if i would initialize without the fix, with --map-depth=2, then stop training, apply fix, and finally resume from the network with 2 map layers in Gen. and 8 map layers in Disc; i.e. initialized the old and wrong way, then everything just starts without any error, number of cond. map. layers in Disc. is decreasing, fakes0000.png looks normal, and training for few more KIMGs also shows no problem.

But if i would go from the opposite, like from 2 cond. map. layers Disc. base, and try to apply --cond-D-nofix, so in the next resume the network would be forced to spawn 6 new randomly initialized mapping layers in Discriminator, no errors is also accuring, but the discoloration(increase in saturation of the images) in fakes0000.png is observable.

I think not to close this issues right away, cause i want to do some minor testing, so i could share the results here.

Big thanks, again, for explanations and your help👍

PDillis commented 1 year ago

No problem, let me know how the tests go then. I'll ask l4rz and others if they have done any tests setting the number of mapping layers for a conditional Discriminator, but from what I can tell, only the mapping of the Generator has been changed and the Discriminator has been set at 8. For example, in StyleGAN2-ADA, they test this with CIFAR-10 (low res, but still a test):

image image

@pbaylies has a conditional WikiArt model and, as far as I can tell, he set the number of mapping layers in D to the default value of 8, but I'll ask just in case.

Lastly, I do think that adding an argument for setting the number of mapping layers (defaulting to 8) is a better choice, and a clearer one. This will let the users know what is the default value (which many don't touch), and give the flexibility to easily change this or test with it. But this will all depend on how your results go, to see if it's worth it.

pbaylies commented 1 year ago

I trained a conditional WikiArt model on StyleGAN2 using the defaults (see https://github.com/pbaylies/stylegan2-ada - I do also have other checkpoints on PyTorch that never got released I guess), and @justinpinkney trained an unconditional StyleGAN3 WikiArt model using the defaults (see https://github.com/justinpinkney/awesome-pretrained-stylegan3 for that one and a few others). Note that you should be able to do model surgery and retrain to make a checkpoint conditional, but it probably makes more sense to train from scratch if you can.

Unrelated, there's also a stable diffusion WikiArt model that derives from the same dataset I had used (but with more added captions etc.), here: https://huggingface.co/valhalla/sd-wikiart-v2

DEBIHOOD commented 1 year ago

I have some results, sorry that it took quite a bit of time for the results to arrive, it takes some mental frontier to start training SG on my main and the only PC when my GPU completely doesn't suits the needs that SG2 training requires. Also, initially i wanted to do these experiments on different dataset, it was some flowers dataset(not oxford flowers 102), and i wanted to add some more conditional information by specifying the color of the flowers as a binary value, but original dataset didn't had that, so half way through adding this info manually, i deleted whole folder by accident. It will be my lesson to not delete folders with SHIFT+DEL 😶‍🌫️. Either way, i think this dataset would't worked out that great, because for example there is no red sunflowers in the dateset, and SG conditioning is very bad at these types of things.

Long story short - Number of Fully Connected layers corresponding for conditioned model doesn't matter that much, 2 FC layers in the discriminator doesn't do much worse job on fitting the generator VS 8 FC layers that SG2&3 will initialize by default.

It didn't affected training times pretty much at all 8 FC Layers in Disc for 10000KIMG - 2 days 9 hours 26 minutes 2 FC Layers in Disc for 10000KIMG - 2 days 10 hours 5 minutes I suspect that longer time that it took to train net with less layers is because i used --snap=10 for the 2 FC layers run, but noticed that saving that often is pointless and final folder weighted 50GB, so for next run (8 FC layers in Disc), i used --snap=50.

Memory footprint is also quite minor (Numbers taken from log.txt, cause usually it differs from what task manager reports) 8 FC Layers in Disc - 1.34-1.35 gpumem 2 FC Layers in Disc - 1.32-1.33 gpumem

Size of the .pkl files 8 FC Layers in Disc - 157MB 2 FC Layers in Disc - 151MB

Oh and command line prompt is python train.py --cfg=stylegan2 --cbase=4096 --cond=1 --map-depth=2 --fp32=1 --gamma=0.3 --batch=16 --gpus=1 --metrics=none --snap=10 for 2 FC layers in Disc. python train.py --cfg=stylegan2 --cbase=4096 --cond=1 --map-depth=2 --fp32=1 --gamma=0.3 --batch=16 --gpus=1 --metrics=none --snap=50 --cond-d-nofix=1 for 8 FC layers in Disc. No x-mirroring because of the nature of the dataset. Forcing all layers to be full precision --fp32=1 because my GTX1060 6GB (and all pascal cards) doesn't have tensor cores, and aside from reduced memory usage, it gets pretty noticible loss in training performance. No metrics to not lose time on them at the training stage and compute them at the end.

For the reason that i mentioned at the begging, i needed some other dataset, so i made a quick mix from the leftover flowers dataset, some weapons images collected by someone and gladly shared through torrent, and coloring book images collection that someone scraped from some site that has lots of them, and also aquired same way as the weapons dataset. Peek at the REALS image Yes, it's 64x64, but with 1060 it's pretty much the luxury : )

Enough talk, lets finally look at the FAKES! 2 FC layers in Discriminator 10000KIMGS fakes010000-2FCdisc 8 FC layers in Discriminator 10000KIMGS fakes010000-8FCdisc Ouch! Flowers - fine, weapons - SAD, coloring book - ... Quite expected considering how small of a network this is, --cbase=4096 doing it's thing, other way around it would took 10 eternities to finish, but i think it's good enough for the test. Of course bigger batch and much bigger config could take it miles away from where it is currently.

I've also have computed the fid50k metric on these models, because just looking at the fakes isn't giving enough clue on how models differs. 8 FC Layers in Disc - 12.80 fid50k_full 2 FC Layers in Disc - 12.94 fid50k_full As another way of testing how well the model did it, i've came up with the idea to average lots of images(512 in this case) of reals and fakes separately, and compare them visually. flowers weapons coloringbook 8 FC layers seems to be always quite a bit smoother and less aliased for every class.

Generator always had 2 mapping layers in all both of these runs, so with --cond-d-nofix=1 it either way that the stylegan discriminator will follow the generator's number of FC layers, or use default 8. One important outcome that this experiment had shown, is that 8 FC layers in discriminator doesn't break the model or making constraining it in any way, and it works fine with 2 FC layers in GEN and 8 FC layers in DISC. I can assume that it did a bit better job because it reduced the strain on the convolution layers in DISC and most of the classification was mostly helded by FC layers, which might be better for this kind of thing, so the model fitted a bit better.

About letting user to choose the number of FC layers in Disc; as it have been shown that the difference between 8 vs 2 wasn't that big, i can't think of any situation where the user might want something different over the default 8 or just following the number of FC layers in GEN, but that's totally up to you if you would want this to be the option that the user can contol, i just think that it might add more confusion, because of the way how stylegan might react to what the model had originally, and with what setting it was resumed with, as i pointed out in previous message.

Also, in the process i noticed a bug, using gen_images.py and visualizer.py for conditioned models with truncation != 1.0 breaks the results, making it kinda like it just showing something averaged between all classes, but that totally unrelated to this and just a fun finding.

PDillis commented 1 year ago

Thanks a lot for your detailed response @DEBIHOOD! Since you already have trained two networks with different mapping layers, is it possible for you to plot the spectral analysis (via running avg_spectra.py), or use e.g. the FFT available in visualizer.py, as a means of comparison?

Given that the source of frequency bias comes from the Discriminator, it'd be interesting to see how the number of mapping layers affects this.

DEBIHOOD commented 1 year ago

Sure, here it is dataset heatmap dataset heatmap

2fc_disc heatmap 2fc_disc heatmap

8fc_disc heatmap 8fc_disc heatmap

slices slices As you can see, there is some strange spike at the end of 8fc_disc graph.

Didn't really read this papers, and don't exactly remember what they were up-to with this spectral analysis in StyleGAN 3 paper, so can't really say anything about these result 🥴 Probably would need to peek into it a bit)

EDIT

Here is a CMD prompts that was used, just in case prompt Don't pay too much attention on what's happening at the last few prompts, i was just trying to experiment with the other model of 8fc_disc that i accidentially stopped at 10200 kimgs, not 10000kimgs, but i didn't used this 10200kimgs model in all tests for consistency with the other 2fc_disc model. Tried to give it a shot here, cause i thought maybe it is just some fluctuations.

8fc_disc_plus200kimg heatmap 8fc_disc_plus200kimg heatmap

slices+ slices2