NVlabs / stylegan3

Official PyTorch implementation of StyleGAN3
Other
6.34k stars 1.12k forks source link

Can we get more useful pretrained models? #42

Closed WyattAutomation closed 2 years ago

WyattAutomation commented 2 years ago

I'm all out of spare $40k ML rigs and colab pro is going to take months to train this. A lot of us simply don't have the money to train this from scratch.

With the chip shortage making this particularly tough, is it out of the question that we will see something other than just pretrained face models? The only way for us plebs to really use any of this code is to transfer learn onto it at the moment or wait on a handful of benevolent randos to share a model. Adjusting when retraining from scratch isn't viable without an unreasonably expensive piece of hardware (at the moment at least).

A full body model trained on humanoid characters or photos, a landscape model, a generic art model--any of those would be more creatively useful than cat pic and portrait models. I really want to dig in to StyleGAN3 and explore what it can do but I'm struggling to transfer learn anything other than faces onto it. Portrait models aren't working for transferring anything other than other portraits (expected but I at least wanted to try).

Only recently have gotten into StyleGAN, maybe I'll be able to use it in a couple years..

WyattAutomation commented 2 years ago

Wanted to follow up and show a little effort here on my part so I don't look like I'm just whining--sorry for being complacent, this work is incredible and I appreciate that it was open sourced at all, so above all else thank you for that.

Anyway:

I went ahead and began an attempt of fine-tuning of the stylegan3-r-metfacesu-1024x1024 model on colab with a single P100 (16gb vram). I prepared a small sample dataset of Gwern's Cropped "Figures" dataset of 20k images. I resized this small sample dataset to 1024x1024, keeping the aspect ratio and adding a black background to prevent distortion of the figures in the images.

I added the processed dataset to a zip archive, and they are in use in training now.

Here is the command I used for training:

!python train.py --outdir=./results --cfg=stylegan3-r --data=./datasets/squared-1024-20k.zip \
--gpus=1 --batch=32 --batch-gpu=4 --gamma=32.0 --kimg=5000 --snap=1 --metrics=None --dlr=0.0025 \
--resume=/content/drive/MyDrive/colab-sg3-FineTune/stylegan3/results/00001-stylegan3-r-squared-1024-20k-gpus1-batch32-gamma32/network-snapshot-000052.pkl

These are the initial fakes, the model loads and generates images that look as expected:

fakes_init (2)

The faces from the portrait unfortunately after several hours did not take on properties of the smaller faces from the dataset (also expected). They just morph over time into noisy hints at character bodies--this is after 16 and 24 kimg respectively: fakes000016 (2)

fakes000024 (1)

This is the latest result at 56 kimg after resuming at 24+ hours, after the colab session closed. There is some resemblence of the input dataset forming, and I am going to let it keep running but I am worried this isn't going to improve since it never appeared to really "make use" of any fundamental features (specifically, faces) from the portraits the original model was generating: fakes000004 (1)

I hope I am wrong in assuming that the dataset is too different from the pretrained model, and that the finetuning yields acceptably impressive results after a couple weeks but it's going to take another 10 days or so to determine that--failure is not fast here so it is tough to have to start over after weeks or more. I have another that I am attempting to train from scratch on 80K images at 512x512, but that is not generating anything resembling the dataset yet.

I am also aware that I probably need to clean up the dataset, perhaps a technique to semi-crop to leave just the upper half of the character body and make the input more homogenous would be where to start, but if that fails then its another burnt couple of weeks.

Anyway, thanks for your work, I am going to look into maybe Lambda Labs cloud or some other service, but it seems like Transfer learning would be a better option if we had more diverse or generic models.

Gwern's danbooru2019Figures Citing here: @misc{danbooru2019Figures, author = {Gwern Branwen and Anonymous and Danbooru Community}, title = {Danbooru2019: A Large-Scale Anime Character Illustration Dataset}, howpublished = {\url{https://www.gwern.net/Crops#figures}}, url = {https://www.gwern.net/Crops#figures}, type = {dataset}, year = {2020}, month = {May}, timestamp = {2020-05-31}, note = {Accessed: DATE} }

WyattAutomation commented 2 years ago

It's only at 80 kimg total, not sure when a reasonable amount of time for fine tuning should stop but... Maybe I spoke too soon?

I certainly don't have stellar results thus far but it keeps improving. I'll go ahead and close this issue and follow up if/when it turns out the pretrained models work just fine for a broader range of domains than I thought...

fakes000028

XuZhengzhuo commented 2 years ago

If we finetune the stylegan2 with limited images (~300), about 50kimg is enough. However, for stylegan3-r, it seems not to be covered after training 60kimg and no satisfactory results are obtained even 100 kimg. What is the proper kimg for a limited fine-tune dataset?

WyattAutomation commented 2 years ago

I wanted to follow up here and share some decent low/no-budget results that I was able to dial-in with training StyleGAN3 that were good enough for what I needed them to do--reliably generate "realistic-enough" low res images of full body humans that other low-cost model inferencing for super-resolution, pose transfer, styletransfer, and 3D reconstruction easily cling to for further enhancement.

It took less than 2 days finetuning the ffhqu255 model to get these results, with a single P100 GPU on colab pro on a dataset of about 23k images (then ~50k in a later training). Two things matter:

-Homogenize and clean the living daylights out of your dataset if you want specific results, fast and cheap. This model seems better fitted for a purpose-driven pipeline than creativity, so come up with a situation where having "a generator of clean images of x object" is useful.

-Try to limit the amount of guess-work on calculating the right gamma value for your model during training. I probably just got lucky on this, but I set my gamma to 3.28, very loosely based on some details that I found in the paper for SG2-ADA and that seemed to fix most of the issues I was having. The model that produced the results here failed after only ~40kimg at gamma=2, gamma=1, and gamma=8 but took flight indefinitely at gama=3.28 and has only improved since restarting it at that gamma setting--i.e.:

"R1 regularization Karras et al. postulated that the best choice for the R1 regularization weightγ is highly dependent on the dataset. (In Figure 24), considering γ∈{0.001,0.002,0.005,...,20,50,100}. Although the optimal γ does vary wildly, from 0.01 to 10, it seems to scale almost linearly with the resolution of the dataset. In practice, we have found that a good initial guess is given by γ0=0.0002·N/M, where N=w×h is the number of pixels and M is the minibatch size. Nevertheless, the optimal value of γ tends to vary depending on the dataset, so we recommend experimenting with different values in the range γ∈[γ0/5,γ0·5]"

..anyway, below is a description of the dataset I prepped, and one process that I tried actually that actually did appear to provide worthwhile return on time investment in training this finicky unicorn (copied from another comment after I finally dialed-in a reliable sweet spot):

... (I took about a week) putting together the dataset for the training that generated these results (u2net/rembg for removing the background of each bounding-box cropped highest value YOLOv4 detection from a single 'person' in a collection of fashion datasets I put together. each alpha blended and scaled, keeping aspect ratio, back onto 256x256 white backgrounds so that about 95% of the images have a single full body person in the center of a square white image. N=~50k). The dataset needs to be manaully cleaned of remaining images with just shirts/clothes etc in them or ones with bad alpha blending/segmasking, but there were maybe like 200-500 bad images in the whole dataset (and it still generates them to a disproportionate amount seemingly? idk..)

The 256x256 models are quite fast, like an order of magnitude faster to transfer learn than 512 but the results are meh.. I have this dataset in 512 and 1024 too (most image original resolutions were much larger than 1024, so they're the 512 and 256 are just downscales from the 1024), but have been working on another experiment in an attempt to train on multiple domains in SG2-ADA Pytorch (southern hip hop + gangster rap album art with parental advisory logo, + anime combat scenes with firearms + hand picked images from Gwern's Figures Crops dataset, N=~60k). I want to take what I learn from the multidomain experiment and apply it to training one of the later StyleGANs on depth images, then use something like pix2styl2pix to train again for translation into depth-registered RGB image generation, so the logical place to start was getting a full body human generator working. If I had the GPU I'd give the 1024 faces model from SG3 a whirl on the dataset I created but it was such a boondoggle. Weeks of training to get unusuable results that promptly exploded--not fun lol:

Anyway Below is an interpolation and a mosaic after finetuning the SG3 ffhqu 256 model for an evening on colab pro (about 7 hours). Looks okay from a distance, and setting the trunc to .6 makes it much more bearable (the interpolation video was from an earlier attempt, only after like 90-100 kimg on the dataset I prepped, before I cleaned it lot better and added more images.. but the trunc 0.5 in that video made it at least bearable so that's an ok sign I suppose. The image mosaic is like 420 or so kimg with trunc at 1.0 (or whatever the default is for progress). I followed some basic arithmetic from the paper on calculating gamma and put it to 3.2, and that appears to have been the magic sauce for this model-- kept it from exploding after 100 kimg or so (which it did about 4 times before it remained stable up through 400kimg). Considering just getting this model as good as it can get and then just using an upscaler on the outputs for generating 2D-3D skeletons or 2D-3D 'neural texture' rendering etc:

interpolation at about 100kimg per the previous description, trunc at .6: 380kimg

Another run with a cleaner/larger dataset up to about 400kimg, results probably would look much better if I did another output at a lower trunc (might try that and post a result here), I think the default is 1 for the progress images during training for SG? which is what this is (not sure if that trunc default is correct but it's certainly higher than .5 in the progress images, so it's likely this one look way better when truncated properly) fakes000384

....

and the latest, after about 600k is below. I'll share some of the 3D reconstruction results soon, as those actually turned out a good bit better and prove the viability of this model for a base-line generator for a bigger pipeline: fakes000432

WyattAutomation commented 2 years ago

The particularly bad failures that are in the last set of results are mostly due to not tightening the threshold for YOLO's detection of the largest "person" in an image well enough to avoid detecting clothing as a 'person', but also from not manually cleaning up U2NET masking failures that regularly occured during background removal.

One of the weirder phenomena was that for a good while it started generating lizard/frog heads on the erroneous outputs. I made a point to save several checkpoints at these results, should be fun to turn them into 3D assets...

..No lizards in the training set, no lizards in the model used for fine-tuning, but lizards snuck into the output somehow. Seems to be the best I can get out of this thing, but I suppose it could be worse?..

img_1_1641279173582

OO-ooo-OO commented 2 years ago

Omg. Late at night. Trying to find a decent gan myself. Skimming through this. I laughed my ass off hard from the last photo. As if my last couple of hours was building up to this punchline.

PS: I don't know if this whole post is a joke. But just in case it wasn't. I assume you noticed the pretrained gan you used turned your photos into anime girls, cus that's what it was trained to do.

PPS: Even after hours of trying to find a simple way to execute a Gan (Preferably in github itself). I still don't know how. This Gan / python / code project stuff is hard to understand. Especially if you're hesitant to run it on your own computer.

nuclearsugar commented 2 years ago

Thanks for sharing your experience regarding gamma values. I've found it very difficult to much of any details on this vital topic, especially when using smaller datasets (500 to 2000 images).

For posterity, here are some of my own findings when fine-tuning using the FFHQ-512 model as a starting point:

If your dataset allows for it, the StyleGAN3-Fun fork is very useful since it allows for both MirrorX and MirrorY, if your dataset allows for it, which will expand the amount of images used for training. Also the fork adds resume-kimg support and a bunch of other interesting tools. https://github.com/PDillis/stylegan3-fun

I was doing some exploring of the various attributes of StyleGAN3 and researching what exactly FreezeD does. Which led me to stumble across this paper explaining that you can freeze the lower levels of the discriminator to improve the fine-tune training process and help to avoid mode collapse in small datasets. I hadn't seen anyone really talking about this online and so I tried it out freezing the first 4 layers... https://arxiv.org/abs/2002.10964

And the results were surprising in that it definitely helped with overfitting along with the huge bonus that it cut the training time in half, which makes sense since only 4 of the 8 layers of the neural network are being retrained. So better results, with less self repetition, and trains in half the time... Hell yeah!