[Reproduction issue] Semantic image synthesis and layout-to-image cannot be reproduced

gene-rative commented 1 year ago

Can you provide inference scripts for semantic image synthesis and layout-to-image synthesis? I tried to use data loaders from the taming-transformers repo but got random noise outputs. The evaluation results are far from those reported in the paper. Thanks!

wangqiang9 commented 1 year ago

same question

shunk031 commented 1 year ago

Thank you very much for publishing your excellent research results. I am also interested in reproducing the layout-to-image model as well. Is there any reproduction code available? Thank you in advance for your consideration.

ZGCTroy commented 1 year ago

Also waiting for the release of the pretrained layout-to-image model trained from scratch on COCO and the dataet code. Thanks !!

Feanor007 commented 1 year ago

Also waiting for the semantic synthesis training pipeline

otamic commented 1 year ago

Hi,

I managed to train the semantic image synthesis model. I first collected the flickr data according to readme from taming-transformers repo, and used sflckr.py as training dataset.

Then, I wrote the yaml config file according to yaml config file:

model:
  base_learning_rate: 1.0e-06
  target: ldm.models.diffusion.ddpm.LatentDiffusion
  params:
    linear_start: 0.0015
    linear_end: 0.0205
    log_every_t: 100
    timesteps: 1000
    loss_type: l1
    first_stage_key: image
    cond_stage_key: segmentation
    image_size: 64
    channels: 3
    concat_mode: true
    cond_stage_trainable: true

    scheduler_config: # 10000 warmup steps
      target: ldm.lr_scheduler.LambdaLinearScheduler
      params:
        warm_up_steps: [ 10000 ]
        cycle_lengths: [ 10000000000000 ]
        f_start: [ 1.e-6 ]
        f_max: [ 1. ]
        f_min: [ 1. ]

    unet_config:
      target: ldm.modules.diffusionmodules.openaimodel.UNetModel
      params:
        image_size: 64
        in_channels: 6
        out_channels: 3
        model_channels: 128
        attention_resolutions:
        - 32
        - 16
        - 8
        num_res_blocks: 2
        channel_mult:
        - 1
        - 4
        - 8
        num_heads: 8

    first_stage_config:
      target: ldm.models.autoencoder.VQModelInterface
      params:
        embed_dim: 3
        n_embed: 8192
        ckpt_path: models/first_stage_models/vq-f4/model.ckpt
        ddconfig:
          double_z: false
          z_channels: 3
          resolution: 256
          in_channels: 3
          out_ch: 3
          ch: 128
          ch_mult:
          - 1
          - 2
          - 4
          num_res_blocks: 2
          attn_resolutions: []
          dropout: 0.0
        lossconfig:
          target: torch.nn.Identity

    cond_stage_config:
      target: ldm.modules.encoders.modules.SpatialRescaler
      params:
        n_stages: 2
        in_channels: 182
        out_channels: 3

data:
  target: main.DataModuleFromConfig
  params:
    batch_size: 12
    num_workers: 5
    wrap: False
    train:
      target: ldm.data.flickr.FlickrSegTrain  #  PUT YOUR DATASET 
      params:
        size: 256
    validation:
      target: ldm.data.flickr.FlickrSegEval  #  PUT YOUR DATASET 
      params:
        size: 256

lightning:
  callbacks:
    image_logger:
      target: main.ImageLogger
      params:
        batch_frequency: 5000
        max_images: 8
        increase_log_steps: False

  trainer:
    benchmark: True

And last, I followed python main.py --base <config_above>.yaml -t --gpus 0, to train the model.

It did work. Here is a result coming from my training process:

conditions original_conditioning_gs-045000_e-000082_b-000044 samples samples_gs-045000_e-000082_b-000044

By the way, I find that the config yaml file doesn't load ckpt at first stage config

first_stage_config:
  target: ldm.models.autoencoder.VQModelInterface
  params:
    embed_dim: 3
    n_embed: 8192
    ckpt_path: models/first_stage_models/vq-f4/model.ckpt  # this line is missing 
    ddconfig:
      double_z: false

I wonder whether this is the reason for failing in inference.

YorkNishi999 commented 1 year ago

@otamic I saw your fantastic results. I am struggling with how to inference (test) by the pretrained model to generate landscape images from segmentation images. Could you share your code to inference (test) if you could?

otamic commented 1 year ago

@YorkNishi999

This's my inference code, which mostly comes from the log images in ddpm

import torch
import numpy as np

from scripts.sample_diffusion import load_model
from omegaconf import OmegaConf
from torch.utils.data import Dataset, DataLoader
from torchvision.utils import save_image
from einops import rearrange

from ldm.data.flickr import FlickrSegEval

def ldm_cond_sample(config_path, ckpt_path, dataset, batch_size):
    config = OmegaConf.load(config_path)
    model, _ = load_model(config, ckpt_path, None, None)

    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

    x = next(iter(dataloader))

    seg = x['segmentation']

    with torch.no_grad():
        seg = rearrange(seg, 'b h w c -> b c h w')
        condition = model.to_rgb(seg)

        seg = seg.to('cuda').float()
        seg = model.get_learned_conditioning(seg)

        samples, _ = model.sample_log(cond=seg, batch_size=batch_size, ddim=True,
                                      ddim_steps=200, eta=1.)

        samples = model.decode_first_stage(samples)

    save_image(condition, 'cond.png')
    save_image(samples, 'sample.png')

if __name__ == '__main__':

    config_path = 'models\ldm\semantic_synthesis256\config.yaml'
    ckpt_path = 'models\ldm\semantic_synthesis256\model.ckpt'

    dataset = FlickrSegEval(size=256)

    ldm_cond_sample(config_path, ckpt_path, dataset, 4)

Note that there's one line missing as I descriped above in the config file.

I simply picked up some segmentations from the dataset to generate images, where you may want to make some changes to suit your needs.

YorkNishi999 commented 1 year ago

@otamic I am very grateful for you to share your code!

I used your code and generated the images but it is low quality. I would make sure that you train the model from ckpt_path = 'models\ldm\semantic_synthesis256\model.ckpt' at first, then, you inference (generate) the images from semantic images. Am I correct?

My generated image is here:

otamic commented 1 year ago

@YorkNishi999

In fact, models\ldm\semantic_synthesis256\model.ckpt refers to the pretrained model downloaded from Pretrained LDMs when I wrote this code.

To test your own trained model, just change the path to something like logs/xxxx/checkpoints/last.ckpt after a training process. (So you are right.)

This's a result tested on the downloaded model:

condition cond sample sample

And my trained model:

condtion my_cond sample my_sample

It works fine here. So I wonder whether you just haven't trained your model long enough.

JeanJulesBigeard commented 1 year ago

Works perfectly on my side, thanks @otamic !

YorkNishi999 commented 1 year ago

@otamic Thank you for sharing your experiments!! I will retry it with some training..

YorkNishi999 commented 1 year ago

@otamic I got the fine results after looking for my bugs (it is my fault).

Thank you again for your kindness!

SerdarHelli commented 1 year ago

@otamic Wow . that's nice. Can u share your dataloade code ? I want be sure about something. I will write my own :D

otamic commented 1 year ago

@SerdarHelli

I think you mean the dataset class in the config file:

data:
  ...
  params:
    ...
    train:
      target: ldm.data.flickr.FlickrSegTrain  #  PUT YOUR DATASET 
      ...
    validation:
      target: ldm.data.flickr.FlickrSegEval  #  PUT YOUR DATASET 
      ...

If so, I used the code from sflckr.py as described above. There is a Examples class in the script:

class Examples(SegmentationBase):
    def __init__(self, size=None, random_crop=False, interpolation="bicubic"):
        super().__init__(data_csv="data/sflckr_examples.txt",
                         data_root="data/sflckr_images",
                         segmentation_root="data/sflckr_segmentations",
                         size=size, random_crop=random_crop, interpolation=interpolation)

And I added my dataset referring to my own data (collected according to this) like that:

class FlickrSegTrain(SegmentationBase):
    def __init__(self, size=None, random_crop=False, interpolation="bicubic"):
        super().__init__(data_csv='data/flickr/flickr_train.txt',
                         data_root='data/flickr/flickr_images',
                         segmentation_root='data/flickr/flickr_segmentations',
                         size=size, random_crop=random_crop, interpolation=interpolation)

class FlickrSegEval(SegmentationBase):
    def __init__(self, size=None, random_crop=False, interpolation="bicubic"):
        super().__init__(data_csv='data/flickr/flickr_eval.txt',
                         data_root='data/flickr/flickr_images',
                         segmentation_root='data/flickr/flickr_segmentations',
                         size=size, random_crop=random_crop, interpolation=interpolation)

That's all I have done. (It's only very few changes, so I didn't post on it.)

To this point, I believe I have written everything needed to reproduce the semantic synthesis result.

SerdarHelli commented 1 year ago

Yes thanks @otamic https://github.com/CompVis/taming-transformers/blob/master/taming/data/sflckr.py I was searching this one actually. I know they wrote, but I didnt check out it :D

mmash98 commented 1 year ago

@otamic I have trained semantic synthesis 255 on cityscapes with the same config you have share, but I m sample getting this image as a result, do you have any ideas why it can happen?

SerdarHelli commented 1 year ago

@otamic I have trained semantic synthesis 255 on cityscapes with the same config you have share, but I m

getting this image as a result, do you have any ideas why it can happen?

I think You should check out your config . For example , is your last condition stage input channel 182 ? How many labels you have for your dataset ?

SerdarHelli commented 1 year ago

@otamic I have trained semantic synthesis 255 on cityscapes with the same config you have share, but I m

getting this image as a result, do you have any ideas why it can happen?

I think You should check out your config . For example , is your last condition stage input channel 182 ? How many labels you have for your dataset ? @SerdarHelli I have changed it as well, in my case its is 35

I see , did you check out your batches ? I dont know , maybe you trained not enough .

otamic commented 1 year ago

@mmash98

Could you try a smaller batch size, such as 4? If it can't help, I have no other idea.

SerdarHelli commented 1 year ago

@mmash98

Could you try a smaller batch size, such as 4? If it can't help, I have no other idea.

I think . He didnt train enough. At 5k steps, I am getting same results.

SerdarHelli commented 1 year ago

Guys, In addition , should we train our vqgan ? I think we should train vqgan with our own data , if our domain is very different.

Edit : I am gettiing worse results with ldm+vqf4 than Gan for semantic image synthesis . Probably , I should train more . Or my data is very limited for ldm . Maybe on the limited data , ldm is not good

Also , you can train on the colab . I can share code.

Kai-0515 commented 1 year ago

@otamic Hey, may I ask a question? I follow your yaml and inference.py to training images with deepfashion which semantic has 24 categories, and I change 182 to 24. But my results is strange as shown below. I want to know is there any other things to notice or what I did wrong? Looking forward to your reply, thx so much!

otamic commented 1 year ago

@Kai-0515

I think you didn't successfully load the pretrained first stage model. Check that the missing line I mentioned is added, and make sure there is the ckpt file. I actually did have similar results, which is how I found the missing line.

Kai-0515 commented 1 year ago

@otamic You'r right! Thx very much for your quick reply!

SerdarHelli commented 1 year ago

Unlike the gan methods, the condition is converted to RGB image in ldm. So, your categories must be correct, otherwise you will give wrong cond. Also , you must be sure about autoencoder (vq, kl) .

mauerflitzer commented 1 year ago

Has anybody trained a model for the layout2image task yet? I'm not quiet sure how my Bounding boxes input is supposed to look like. Andy what a prpoer configuration would be? Thank you so much for any inputs. I know the layout2img-openimages256 config exists, but I'm not sure how the input is supposed to be.

mauerflitzer commented 1 year ago

@otamic do I understand it correctly that you train everything from scratch, the whole model except for the vq-f4? Is it also possible to skip training the unet and vae and only train the conditioning part?

otamic commented 1 year ago

@mauerflitzer

You are correct about my training. In my opinion, I think only training the conditioning part is impossible at LDM. First, how to supervise this training? Second, the unet structures of conditional and unconditional model are different. In this case, the number of channels at the unet input is doubled when conditioned. But it sounds like the Classifier Guided Diffusion in another conditional case.

mauerflitzer commented 1 year ago

@otamic I thought about freezing the unet and vae weights and taking a released checkpoint of 1.4 or maybe 1.5 and then swap out the conditioning part for the new one and start training on that.

otamic commented 1 year ago

@mauerflitzer

Sorry, I don't understand what you mean the ckeckpoint of 1.4 or 1.5. If the conditioning parts(τ_θ) work in the same way, I think you can just try it. Although I intuitively think it might not work.

mmash98 commented 1 year ago

@otamic have you tested layout to image with bounding boxes? I m trying to find attention block for it, but didn't succeed yet

otamic commented 1 year ago

@mmash98

If I had succeeded in reproducing the layout to image results, I would have posted here. But the truth is, since I first posted here, I've been occupied with other stuff. I do have an interest in testing that, but I can't pick this up until next month. Hopefully someone will share his work by then.

stillbetter commented 1 year ago

class FlickrSegTrain(SegmentationBase): def init(self, size=None, random_crop=False, interpolation="bicubic"): super().init(data_csv='data/flickr/flickr_train.txt', data_root='data/flickr/flickr_images', segmentation_root='data/flickr/flickr_segmentations', size=size, random_crop=random_crop, interpolation=interpolation)

Thanks for the code, I want to know what does the txt file look like. I just want to modified on my own dataset.

otamic commented 1 year ago

@stillbetter

like this

stillbetter commented 1 year ago

@stillbetter

like this

@stillbetter

like this

Great! But I met another question when I run the code as yours above. The ERROR is below:

Data

train, FlickrSegTrain, 18000 validation, FlickrSegEval, 2000 accumulate_grad_batches = 1 Setting learning rate to 4.80e-05 = 1 (accumulate_grad_batches) 4 (num_gpus) 12 (batchsize) * 1.00e-06 (base_lr) Summoning checkpoint. logs/2022-11-21T22-57-47_configgit/checkpoints/last.ckpt

Traceback (most recent call last): File "main.py", line 724, in trainer.fit(model, data) File "/home/huangzhiwei/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 454, in fit self.data_connector.attach_data( File "/home/huangzhiwei/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 87, in attach_data self.attach_datamodule(model, datamodule=datamodule) File "/home/huangzhiwei/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 125, in attach_datamodule if is_overridden(method, datamodule): File "/home/huangzhiwei/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/utilities/model_helpers.py", line 42, in is_overridden is_overridden = instance_attr.code is not super_attr.code AttributeError: 'functools.partial' object has no attribute 'code'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "main.py", line 726, in melk() File "main.py", line 707, in melk trainer.save_checkpoint(ckpt_path) File "/home/huangzhiwei/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/trainer/properties.py", line 330, in save_checkpoint self.checkpoint_connector.save_checkpoint(filepath, weights_only) File "/home/huangzhiwei/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 391, in save_checkpoint _checkpoint = self.dump_checkpoint(weights_only) File "/home/huangzhiwei/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 276, in dump_checkpoint 'state_dict': model.state_dict(), AttributeError: 'NoneType' object has no attribute 'state_dict'

I would appreaciate if you have any idea about this.

stillbetter commented 1 year ago

@stillbetter like this

@stillbetter like this

Great! But I met another question when I run the code as yours above. The ERROR is below:

Data

train, FlickrSegTrain, 18000 validation, FlickrSegEval, 2000 accumulate_grad_batches = 1 Setting learning rate to 4.80e-05 = 1 (accumulate_grad_batches) 4 (num_gpus) 12 (batchsize) * 1.00e-06 (base_lr) Summoning checkpoint. logs/2022-11-21T22-57-47_configgit/checkpoints/last.ckpt

Traceback (most recent call last): File "main.py", line 724, in trainer.fit(model, data) File "/home/huangzhiwei/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 454, in fit self.data_connector.attach_data( File "/home/huangzhiwei/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 87, in attach_data self.attach_datamodule(model, datamodule=datamodule) File "/home/huangzhiwei/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 125, in attach_datamodule if is_overridden(method, datamodule): File "/home/huangzhiwei/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/utilities/model_helpers.py", line 42, in is_overridden is_overridden = instance_attr.code is not super_attr.code AttributeError: 'functools.partial' object has no attribute 'code'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "main.py", line 726, in melk() File "main.py", line 707, in melk trainer.save_checkpoint(ckpt_path) File "/home/huangzhiwei/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/trainer/properties.py", line 330, in save_checkpoint self.checkpoint_connector.save_checkpoint(filepath, weights_only) File "/home/huangzhiwei/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 391, in save_checkpoint _checkpoint = self.dump_checkpoint(weights_only) File "/home/huangzhiwei/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 276, in dump_checkpoint 'state_dict': model.state_dict(), AttributeError: 'NoneType' object has no attribute 'state_dict'

I would appreaciate if you have any idea about this.

Solved. Just because of the pytorch-lightning incompatible version.

GiannisPikoulis commented 1 year ago

Hello, thank you for your comments and code. I have a question about the LDM 'segmentation'-conditional pipeline. As I understand it, the conditional info in this case is concatenenated with the noisy encoded input that is fed into the UNet model. How is this done? I mean, the input is first fed through the VQGAN model and is transformed to a latent representation. Does the segmentation map have to be encoded to a latent too? Does this mean we have to train a separate VQGAN just for segmentation maps? Also, what is the purpose of the SpatialRescaler module in this case. As I understand, HxWx3 input images are transformed to hxwx3 latents with the pretrained VQGAN, but segmentation maps are just interpolated through two downsampling stages (scale=0.5) in order to reach the same dimensionality and are also fed through a 1x1 convolution in order to turn the 182 input channels to just 3. Am I correct?

Please @otamic, I would appreciate any explaination or comment on this.

otamic commented 1 year ago

@GiannisPikoulis

In my opinion, the purpose of the condition stage model or SpatialRescaler in this case is to map the segmentation to the same dimension as the input image been mapped to. Then, these two intermediate representations can be concatenenated and fed to the Unet.

In the 2.2 section of this paper about the conditional DM, it says The only modification that needs to be made is to inject c as a extra input to the neural network function approximators. And the 'interpolation and concatenation' is just one way to inject the condition. I think it will still work when you change the way SpatialRescaler mapping the segmentation or use 'crossattn' conditioning mechanic the ldm used.

otamic commented 1 year ago

Hi guys, I have made some attempt to train the layout to image model.

First, prepare the data. I chose to use the coco dataset, and the folder structure is like this. Then, the dataset class is like:

from taming.data.annotated_objects_coco import AnnotatedObjectsCoco

class COCOTrain(AnnotatedObjectsCoco):
    def __init__(self, size):
        super().__init__(data_path='YOUR DATA PATH/coco',
                         split='train',
                         keys=['image', 'objects_bbox'],
                         no_tokens=8192,
                         target_image_size=size,
                         min_object_area=0.00001,
                         min_objects_per_image=2,
                         max_objects_per_image=30,
                         crop_method='center',
                         random_flip=False,
                         use_group_parameter=True,
                         encode_crop=True)

class COCOValidation(AnnotatedObjectsCoco):
    def __init__(self, size):
        super().__init__(data_path='YOUR DATA PATH/coco',
                         split='validation',
                         keys=['image', 'objects_bbox'],
                         no_tokens=8192,
                         target_image_size=size,
                         min_object_area=0.00001,
                         min_objects_per_image=2,
                         max_objects_per_image=30,
                         crop_method='center',
                         random_flip=False,
                         use_group_parameter=True,
                         encode_crop=True)

And config file:

model:
  base_learning_rate: 2.0e-06
  target: ldm.models.diffusion.ddpm.Layout2ImgDiffusion
  params:
    linear_start: 0.0015
    linear_end: 0.0205
    log_every_t: 100
    timesteps: 1000
    loss_type: l1
    first_stage_key: image
    cond_stage_key: objects_bbox
    image_size: 64
    channels: 3
    conditioning_key: crossattn
    cond_stage_trainable: true

    scheduler_config: # 10000 warmup steps
      target: ldm.lr_scheduler.LambdaLinearScheduler
      params:
        warm_up_steps: [ 10000 ]
        cycle_lengths: [ 10000000000000 ]
        f_start: [ 1.e-6 ]
        f_max: [ 1. ]
        f_min: [ 1. ]

    unet_config:
      target: ldm.modules.diffusionmodules.openaimodel.UNetModel
      params:
        image_size: 64
        in_channels: 3
        out_channels: 3
        model_channels: 128
        attention_resolutions:
        - 8
        - 4
        - 2
        num_res_blocks: 2
        channel_mult:
        - 1
        - 2
        - 3
        - 4
        num_head_channels: 32
        use_spatial_transformer: true
        transformer_depth: 3
        context_dim: 512
    first_stage_config:
      target: ldm.models.autoencoder.VQModelInterface
      params:
        ckpt_path: models/first_stage_models/vq-f4/model.ckpt
        embed_dim: 3
        n_embed: 8192
        monitor: val/rec_loss
        ddconfig:
          double_z: false
          z_channels: 3
          resolution: 256
          in_channels: 3
          out_ch: 3
          ch: 128
          ch_mult:
          - 1
          - 2
          - 4
          num_res_blocks: 2
          attn_resolutions: []
          dropout: 0.0
        lossconfig:
          target: torch.nn.Identity
    cond_stage_config:
      target: ldm.modules.encoders.modules.BERTEmbedder
      params:
        n_embed: 512
        n_layer: 16
        vocab_size: 8192
        max_seq_len: 92
        use_tokenizer: false
    monitor: val/loss_simple_ema
data:
  target: main.DataModuleFromConfig
  params:
    batch_size: 8
    wrap: false
    num_workers: 5
    train:
      target: ldm.data.coco.COCOTrain
      params:
        size: 256
    validation:
      target: ldm.data.coco.COCOValidation
      params:
        size: 256

lightning:
  callbacks:
    image_logger:
      target: main.ImageLogger
      params:
        batch_frequency: 5000
        max_images: 8
        increase_log_steps: False

  trainer:
    benchmark: True

At last: I don't know what's the difference between coordinates_bbox in ddpm.py and objects_bbox here, so I simply replace all coordinates_bbox with objects_bbox in ddpm.py. (Someone can tell the difference?) And make sure this font exists. Then, use python main.py --base <config_above>.yaml -t --gpus 0, to train the model.

Note that: My remote machine broke down before I could complete my training. I'm not sure whether I can reconnect that machine, and It will spent some days to train it again elsewhere (I'm not sure whether I can find some other machines too). So I decide to post it here first, hoping someone will train the model and tell the results. From my last training results, the model did try to do the job but it is not so perfect because of the lack of training time. The model structure described above is different from the that used for the coco dataset in the original work (described in Table 15 of the ldm paper), but I don't think it matters.

mauerflitzer commented 1 year ago

Hi @otamic, I was doing a very similar approach but I train on the Cityscapes dataset and I only train the BertEmbedder, everything else I use the weights from stable diffusion 1.5. So far I do not have enough training yet (maybe 7 hours), but I get results which are already a little bit understandable and show resemblance of the intended layout.

GiannisPikoulis commented 1 year ago

@GiannisPikoulis

In my opinion, the purpose of the condition stage model or SpatialRescaler in this case is to map the segmentation to the same dimension as the input image been mapped to. Then, these two intermediate representations can be concatenenated and fed to the Unet.

In the 2.2 section of this paper about the conditional DM, it says The only modification that needs to be made is to inject c as a extra input to the neural network function approximators. And the 'interpolation and concatenation' is just one way to inject the condition. I think it will still work when you change the way SpatialRescaler mapping the segmentation or use 'crossattn' conditioning mechanic the ldm used.

@otamic So, is my undestanding correct? The segmentation maps are not transformed into latents through a VQGAN. They are just downsampled in order to match the dimensions of the input image latents.

pokameng commented 1 year ago

@otamic hello bro! Your work is great! I have trained semantic image synthesis model on coco but the result is bad. I want to know the flickr dataset how to construction? can you share the dataset wit me ?or tell the method to construction this dataset. thanks!!!

AlexofNTU commented 1 year ago

@otamic Thanks for your explanation so I can finally produce semantic image synthesis. However, I would like to try to produce fixed results corresponding to the same random seed. However, I have tried different ways to fixed the random seed and make sure random numbers in numpy and torch are exactly the same after assigning the same random seed. I also follow the following post and colab example which tried to generate the same images in stable diffusion by fixing the random seed and it works very well!

https://huggingface.co/CompVis/stable-diffusion-v1-4/discussions/15

https://colab.research.google.com/github/pcuenca/diffusers-examples/blob/main/notebooks/stable-diffusion-seeds.ipynb#scrollTo=9af32168

However, the results I generate every time are still very different! mask: mask1 results: flickr1 flickr2 flickr3

ustczhouyu commented 1 year ago

Hello, I would like to ask the difference between unconditional LDM and conditional LDM. After the model is trained, is unconditional sampling generate image randomly, but not based on a given image? So, if I want to generate a normal image from a flawed image (without any annotations in the inference phase), should I use conditional LDM? @shunk031 @AlexofNTU @mauerflitzer @mmash98 @Feanor007

dydxdt commented 1 year ago

Does anyone realize the code for super-resolution? Thx very much!

xingshuojing commented 1 year ago

Hi,

I managed to train the semantic image synthesis model. I first collected the flickr data according to readme from taming-transformers repo, and used sflckr.py as training dataset.

Then, I wrote the yaml config file according to yaml config file:

model:
  base_learning_rate: 1.0e-06
  target: ldm.models.diffusion.ddpm.LatentDiffusion
  params:
    linear_start: 0.0015
    linear_end: 0.0205
    log_every_t: 100
    timesteps: 1000
    loss_type: l1
    first_stage_key: image
    cond_stage_key: segmentation
    image_size: 64
    channels: 3
    concat_mode: true
    cond_stage_trainable: true

    scheduler_config: # 10000 warmup steps
      target: ldm.lr_scheduler.LambdaLinearScheduler
      params:
        warm_up_steps: [ 10000 ]
        cycle_lengths: [ 10000000000000 ]
        f_start: [ 1.e-6 ]
        f_max: [ 1. ]
        f_min: [ 1. ]

    unet_config:
      target: ldm.modules.diffusionmodules.openaimodel.UNetModel
      params:
        image_size: 64
        in_channels: 6
        out_channels: 3
        model_channels: 128
        attention_resolutions:
        - 32
        - 16
        - 8
        num_res_blocks: 2
        channel_mult:
        - 1
        - 4
        - 8
        num_heads: 8

    first_stage_config:
      target: ldm.models.autoencoder.VQModelInterface
      params:
        embed_dim: 3
        n_embed: 8192
        ckpt_path: models/first_stage_models/vq-f4/model.ckpt
        ddconfig:
          double_z: false
          z_channels: 3
          resolution: 256
          in_channels: 3
          out_ch: 3
          ch: 128
          ch_mult:
          - 1
          - 2
          - 4
          num_res_blocks: 2
          attn_resolutions: []
          dropout: 0.0
        lossconfig:
          target: torch.nn.Identity

    cond_stage_config:
      target: ldm.modules.encoders.modules.SpatialRescaler
      params:
        n_stages: 2
        in_channels: 182
        out_channels: 3

data:
  target: main.DataModuleFromConfig
  params:
    batch_size: 12
    num_workers: 5
    wrap: False
    train:
      target: ldm.data.flickr.FlickrSegTrain  #  PUT YOUR DATASET 
      params:
        size: 256
    validation:
      target: ldm.data.flickr.FlickrSegEval  #  PUT YOUR DATASET 
      params:
        size: 256

lightning:
  callbacks:
    image_logger:
      target: main.ImageLogger
      params:
        batch_frequency: 5000
        max_images: 8
        increase_log_steps: False

  trainer:
    benchmark: True

And last, I followed python main.py --base <config_above>.yaml -t --gpus 0, to train the model.

It did work. Here is a result coming from my training process:

conditions original_conditioning_gs-045000_e-000082_b-000044 samples samples_gs-045000_e-000082_b-000044

By the way, I find that the config yaml file doesn't load ckpt at first stage config

first_stage_config:
  target: ldm.models.autoencoder.VQModelInterface
  params:
    embed_dim: 3
    n_embed: 8192
    ckpt_path: models/first_stage_models/vq-f4/model.ckpt  # this line is missing 
    ddconfig:
      double_z: false

I wonder whether this is the reason for failing in inference.

@otamic hi, the training results looks like really nice, how many epoches did you set in the training process? I trained 200 epoches, but I found that the results are little bad.

mia01 commented 8 months ago

@mauerflitzer @otamic when you trained the Layout2Image I noticed the BertEmbedder was used for the condition stage. Can you explain this because it confused me. I thought the BertEmbedder was for caption conditioning only? how does it work in this case? thanks

mia01 commented 6 months ago

@mauerflitzer did you get any good results from your training? I have been trying to train from scratch for Layout to Image but the training is taking too long. I used ColabPro+ but keep getting disconnected. Is using a checkpoint for 1.5 feasible? This is my result from 9 epochs:

SahadevPoudel commented 3 weeks ago

@otamic Thanks for your explanation so I can finally produce semantic image synthesis. However, I would like to try to produce fixed results corresponding to the same random seed. However, I have tried different ways to fixed the random seed and make sure random numbers in numpy and torch are exactly the same after assigning the same random seed. I also follow the following post and colab example which tried to generate the same images in stable diffusion by fixing the random seed and it works very well!

https://huggingface.co/CompVis/stable-diffusion-v1-4/discussions/15

https://colab.research.google.com/github/pcuenca/diffusers-examples/blob/main/notebooks/stable-diffusion-seeds.ipynb#scrollTo=9af32168

However, the results I generate every time are still very different! mask: results:

Hi, did you solve this issue?

CompVis / latent-diffusion

[Reproduction issue] Semantic image synthesis and layout-to-image cannot be reproduced #120

Data

Data