dome272 / Wuerstchen

Official implementation of Würstchen: Efficient Pretraining of Text-to-Image Models
https://arxiv.org/abs/2306.00637
MIT License
528 stars 36 forks source link

Checkpoints missing #8

Open nicolas-dufour opened 1 year ago

nicolas-dufour commented 1 year ago

Hi, The only checkpoint available on huggingface is stage c. Where can we find the other stages?

Thanks

Lime-Cakes commented 1 year ago

The latest commit on huggingface is for v3 (slightly different from paper).

If downloading using huggingface API, you'll need to specify revision. For example, to download v1 stage A (vqgan), use checkpoint_path=hf_hub_download(repo_id="dome272/wuerstchen", filename="vqgan_f4_v1_500k.pt", revision="c9a8af033966c756941168f2a537595a15e0c1a8")

I believe the full v3 will be released soon, which would be better as a different stage b conditioning is used, fixing the variable resolution issue mentioned in the paper, so you might want to stick to the new version once it's released.

nicolas-dufour commented 1 year ago

Oh thanks! Any info on what changed compared to the paper?

Lime-Cakes commented 1 year ago

Oh thanks! Any info on what changed compared to the paper?

Based on what I heard in Eleutherai diffusion reading group, the following changes had been made:

Stage A: VQGAN is still used. However, quantization is removed, making it a pseudo VAE. Stage B: Uses LDM (Unet) instead of pealla. Instead of cross attention to inject conditioning, concat is used. Other changes: Different aspect ratio training is used As a result, V3 doesn't have issue with decoding at different resolution (as mentioned by paper in section 5 Discussion)

However, it's highly recommended to wait for the release notes. My notes might be as inaccurate or outdated. It's possible I misunderstood certain things or missed important changes.