crowsonkb / k-diffusion

Karras et al. (2022) diffusion models for PyTorch
MIT License
2.21k stars 371 forks source link

Confirm JSON config for FFHQ-1024? #103

Open tin-sely opened 3 months ago

tin-sely commented 3 months ago

I'm planning on using the config for the FFHQ-1024, just wanted to double check it's correct.

Screenshot 2024-04-16 at 10 47 48
{
  "model": {
    "type": "image_transformer_v2",
    "input_channels": 3, 
    "input_size": [1024, 1024],
    "patch_size": [4, 4],
    "depths": [2, 2, 2, 2, 2], 
    "widths": [128, 256, 384, 768, 1024],
    "self_attns": [
      {"type": "shifted-window", "d_head": 64, "window_size": 7}, 
      {"type": "shifted-window", "d_head": 64, "window_size": 7},
      {"type": "shifted-window", "d_head": 64, "window_size": 7},
      {"type": "global", "d_head": 64},
      {"type": "global", "d_head": 64}
    ],
    "loss_config": "karras",
    "loss_weighting": "soft-min-snr", 
    "dropout_rate": [0.0, 0.0, 0.0, 0.0, 0.1], 
    "mapping_dropout_rate": 0.1,
    "augment_prob": 0.12, 
    "sigma_data": 0.5, 
    "sigma_min": 1e-3,
    "sigma_max": 1e3, 
    "sigma_sample_density": {
      "type": "cosine-interpolated" 
    }
  },
  "dataset": {
    "type": "huggingface", 
    "location": "nelorth/oxford-flowers", 
    "image_key": "image" 
  },
  "optimizer": {
    "type": "adamw",
    "lr": 5e-4, 
    "betas": [0.9, 0.95], 
    "eps": 1e-8, 
    "weight_decay": 1e-2 
  },
  "lr_sched": {
    "type": "constant", 
    "warmup": 0.0 
  },
  "ema_sched": {
    "type": "inverse", 
    "power": 0.75, 
    "max_value": 0.9999 
  }
}
stefan-baumann commented 3 months ago

The type for those self-attention blocks should be neighborhood unless you do want to use Swin, and we used a mapping dropout rate of 0. Apart from that, the config matches what we used.

And to answer your other two questions: 1) Something else iirc 2) Yes