Open gene-rative opened 1 year ago
same question
Thank you very much for publishing your excellent research results. I am also interested in reproducing the layout-to-image model as well. Is there any reproduction code available? Thank you in advance for your consideration.
Also waiting for the release of the pretrained layout-to-image model trained from scratch on COCO and the dataet code. Thanks !!
Also waiting for the semantic synthesis training pipeline
Hi,
I managed to train the semantic image synthesis model. I first collected the flickr data according to readme from taming-transformers repo, and used sflckr.py as training dataset.
Then, I wrote the yaml config file according to yaml config file:
model:
base_learning_rate: 1.0e-06
target: ldm.models.diffusion.ddpm.LatentDiffusion
params:
linear_start: 0.0015
linear_end: 0.0205
log_every_t: 100
timesteps: 1000
loss_type: l1
first_stage_key: image
cond_stage_key: segmentation
image_size: 64
channels: 3
concat_mode: true
cond_stage_trainable: true
scheduler_config: # 10000 warmup steps
target: ldm.lr_scheduler.LambdaLinearScheduler
params:
warm_up_steps: [ 10000 ]
cycle_lengths: [ 10000000000000 ]
f_start: [ 1.e-6 ]
f_max: [ 1. ]
f_min: [ 1. ]
unet_config:
target: ldm.modules.diffusionmodules.openaimodel.UNetModel
params:
image_size: 64
in_channels: 6
out_channels: 3
model_channels: 128
attention_resolutions:
- 32
- 16
- 8
num_res_blocks: 2
channel_mult:
- 1
- 4
- 8
num_heads: 8
first_stage_config:
target: ldm.models.autoencoder.VQModelInterface
params:
embed_dim: 3
n_embed: 8192
ckpt_path: models/first_stage_models/vq-f4/model.ckpt
ddconfig:
double_z: false
z_channels: 3
resolution: 256
in_channels: 3
out_ch: 3
ch: 128
ch_mult:
- 1
- 2
- 4
num_res_blocks: 2
attn_resolutions: []
dropout: 0.0
lossconfig:
target: torch.nn.Identity
cond_stage_config:
target: ldm.modules.encoders.modules.SpatialRescaler
params:
n_stages: 2
in_channels: 182
out_channels: 3
data:
target: main.DataModuleFromConfig
params:
batch_size: 12
num_workers: 5
wrap: False
train:
target: ldm.data.flickr.FlickrSegTrain # PUT YOUR DATASET
params:
size: 256
validation:
target: ldm.data.flickr.FlickrSegEval # PUT YOUR DATASET
params:
size: 256
lightning:
callbacks:
image_logger:
target: main.ImageLogger
params:
batch_frequency: 5000
max_images: 8
increase_log_steps: False
trainer:
benchmark: True
And last, I followed python main.py --base <config_above>.yaml -t --gpus 0,
to train the model.
It did work. Here is a result coming from my training process:
conditions
samples
By the way, I find that the config yaml file doesn't load ckpt at first stage config
first_stage_config:
target: ldm.models.autoencoder.VQModelInterface
params:
embed_dim: 3
n_embed: 8192
ckpt_path: models/first_stage_models/vq-f4/model.ckpt # this line is missing
ddconfig:
double_z: false
I wonder whether this is the reason for failing in inference.
@otamic I saw your fantastic results. I am struggling with how to inference (test) by the pretrained model to generate landscape images from segmentation images. Could you share your code to inference (test) if you could?
@YorkNishi999
This's my inference code, which mostly comes from the log images in ddpm
import torch
import numpy as np
from scripts.sample_diffusion import load_model
from omegaconf import OmegaConf
from torch.utils.data import Dataset, DataLoader
from torchvision.utils import save_image
from einops import rearrange
from ldm.data.flickr import FlickrSegEval
def ldm_cond_sample(config_path, ckpt_path, dataset, batch_size):
config = OmegaConf.load(config_path)
model, _ = load_model(config, ckpt_path, None, None)
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)
x = next(iter(dataloader))
seg = x['segmentation']
with torch.no_grad():
seg = rearrange(seg, 'b h w c -> b c h w')
condition = model.to_rgb(seg)
seg = seg.to('cuda').float()
seg = model.get_learned_conditioning(seg)
samples, _ = model.sample_log(cond=seg, batch_size=batch_size, ddim=True,
ddim_steps=200, eta=1.)
samples = model.decode_first_stage(samples)
save_image(condition, 'cond.png')
save_image(samples, 'sample.png')
if __name__ == '__main__':
config_path = 'models\ldm\semantic_synthesis256\config.yaml'
ckpt_path = 'models\ldm\semantic_synthesis256\model.ckpt'
dataset = FlickrSegEval(size=256)
ldm_cond_sample(config_path, ckpt_path, dataset, 4)
Note that there's one line missing as I descriped above in the config file.
I simply picked up some segmentations from the dataset to generate images, where you may want to make some changes to suit your needs.
@otamic I am very grateful for you to share your code!
I used your code and generated the images but it is low quality. I would make sure that you train the model from ckpt_path = 'models\ldm\semantic_synthesis256\model.ckpt'
at first, then, you inference (generate) the images from semantic images. Am I correct?
My generated image is here:
@YorkNishi999
In fact, models\ldm\semantic_synthesis256\model.ckpt
refers to the pretrained model downloaded from Pretrained LDMs when I wrote this code.
To test your own trained model, just change the path to something like logs/xxxx/checkpoints/last.ckpt
after a training process. (So you are right.)
This's a result tested on the downloaded model:
condition
sample
And my trained model:
condtion
sample
It works fine here. So I wonder whether you just haven't trained your model long enough.
Works perfectly on my side, thanks @otamic !
@otamic Thank you for sharing your experiments!! I will retry it with some training..
@otamic I got the fine results after looking for my bugs (it is my fault).
Thank you again for your kindness!
@otamic Wow . that's nice. Can u share your dataloade code ? I want be sure about something. I will write my own :D
@SerdarHelli
I think you mean the dataset class in the config file:
data:
...
params:
...
train:
target: ldm.data.flickr.FlickrSegTrain # PUT YOUR DATASET
...
validation:
target: ldm.data.flickr.FlickrSegEval # PUT YOUR DATASET
...
If so, I used the code from sflckr.py as described above. There is a Examples
class in the script:
class Examples(SegmentationBase):
def __init__(self, size=None, random_crop=False, interpolation="bicubic"):
super().__init__(data_csv="data/sflckr_examples.txt",
data_root="data/sflckr_images",
segmentation_root="data/sflckr_segmentations",
size=size, random_crop=random_crop, interpolation=interpolation)
And I added my dataset referring to my own data (collected according to this) like that:
class FlickrSegTrain(SegmentationBase):
def __init__(self, size=None, random_crop=False, interpolation="bicubic"):
super().__init__(data_csv='data/flickr/flickr_train.txt',
data_root='data/flickr/flickr_images',
segmentation_root='data/flickr/flickr_segmentations',
size=size, random_crop=random_crop, interpolation=interpolation)
class FlickrSegEval(SegmentationBase):
def __init__(self, size=None, random_crop=False, interpolation="bicubic"):
super().__init__(data_csv='data/flickr/flickr_eval.txt',
data_root='data/flickr/flickr_images',
segmentation_root='data/flickr/flickr_segmentations',
size=size, random_crop=random_crop, interpolation=interpolation)
That's all I have done. (It's only very few changes, so I didn't post on it.)
To this point, I believe I have written everything needed to reproduce the semantic synthesis result.
Yes thanks @otamic https://github.com/CompVis/taming-transformers/blob/master/taming/data/sflckr.py I was searching this one actually. I know they wrote, but I didnt check out it :D
@otamic I have trained semantic synthesis 255 on cityscapes with the same config you have share, but I m
getting this image as a result, do you have any ideas why it can happen?
@otamic I have trained semantic synthesis 255 on cityscapes with the same config you have share, but I m
getting this image as a result, do you have any ideas why it can happen?
I think You should check out your config . For example , is your last condition stage input channel 182 ? How many labels you have for your dataset ?
@otamic I have trained semantic synthesis 255 on cityscapes with the same config you have share, but I m
getting this image as a result, do you have any ideas why it can happen?
I think You should check out your config . For example , is your last condition stage input channel 182 ? How many labels you have for your dataset ? @SerdarHelli I have changed it as well, in my case its is 35
I see , did you check out your batches ? I dont know , maybe you trained not enough .
@mmash98
Could you try a smaller batch size, such as 4? If it can't help, I have no other idea.
@mmash98
Could you try a smaller batch size, such as 4? If it can't help, I have no other idea.
I think . He didnt train enough. At 5k steps, I am getting same results.
Guys, In addition , should we train our vqgan ? I think we should train vqgan with our own data , if our domain is very different.
Edit : I am gettiing worse results with ldm+vqf4 than Gan for semantic image synthesis . Probably , I should train more . Or my data is very limited for ldm . Maybe on the limited data , ldm is not good
Also , you can train on the colab . I can share code.
@otamic Hey, may I ask a question? I follow your yaml and inference.py to training images with deepfashion which semantic has 24 categories, and I change 182 to 24. But my results is strange as shown below. I want to know is there any other things to notice or what I did wrong? Looking forward to your reply, thx so much!
@Kai-0515
I think you didn't successfully load the pretrained first stage model. Check that the missing line I mentioned is added, and make sure there is the ckpt file. I actually did have similar results, which is how I found the missing line.
@otamic You'r right! Thx very much for your quick reply!
Unlike the gan methods, the condition is converted to RGB image in ldm. So, your categories must be correct, otherwise you will give wrong cond. Also , you must be sure about autoencoder (vq, kl) .
Has anybody trained a model for the layout2image task yet? I'm not quiet sure how my Bounding boxes input is supposed to look like. Andy what a prpoer configuration would be? Thank you so much for any inputs. I know the layout2img-openimages256 config exists, but I'm not sure how the input is supposed to be.
@otamic do I understand it correctly that you train everything from scratch, the whole model except for the vq-f4? Is it also possible to skip training the unet and vae and only train the conditioning part?
@mauerflitzer
You are correct about my training. In my opinion, I think only training the conditioning part is impossible at LDM. First, how to supervise this training? Second, the unet structures of conditional and unconditional model are different. In this case, the number of channels at the unet input is doubled when conditioned. But it sounds like the Classifier Guided Diffusion in another conditional case.
@otamic I thought about freezing the unet and vae weights and taking a released checkpoint of 1.4 or maybe 1.5 and then swap out the conditioning part for the new one and start training on that.
@mauerflitzer
Sorry, I don't understand what you mean the ckeckpoint of 1.4 or 1.5. If the conditioning parts(τ_θ) work in the same way, I think you can just try it. Although I intuitively think it might not work.
@otamic have you tested layout to image with bounding boxes? I m trying to find attention block for it, but didn't succeed yet
@mmash98
If I had succeeded in reproducing the layout to image results, I would have posted here. But the truth is, since I first posted here, I've been occupied with other stuff. I do have an interest in testing that, but I can't pick this up until next month. Hopefully someone will share his work by then.
class FlickrSegTrain(SegmentationBase): def init(self, size=None, random_crop=False, interpolation="bicubic"): super().init(data_csv='data/flickr/flickr_train.txt', data_root='data/flickr/flickr_images', segmentation_root='data/flickr/flickr_segmentations', size=size, random_crop=random_crop, interpolation=interpolation)
Thanks for the code, I want to know what does the txt file look like. I just want to modified on my own dataset.
@stillbetter
@stillbetter
Great! But I met another question when I run the code as yours above. The ERROR is below:
train, FlickrSegTrain, 18000 validation, FlickrSegEval, 2000 accumulate_grad_batches = 1 Setting learning rate to 4.80e-05 = 1 (accumulate_grad_batches) 4 (num_gpus) 12 (batchsize) * 1.00e-06 (base_lr) Summoning checkpoint. logs/2022-11-21T22-57-47_configgit/checkpoints/last.ckpt
Traceback (most recent call last):
File "main.py", line 724, in
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "main.py", line 726, in
I would appreaciate if you have any idea about this.
@stillbetter like this
@stillbetter like this
Great! But I met another question when I run the code as yours above. The ERROR is below:
Data
train, FlickrSegTrain, 18000 validation, FlickrSegEval, 2000 accumulate_grad_batches = 1 Setting learning rate to 4.80e-05 = 1 (accumulate_grad_batches) 4 (num_gpus) 12 (batchsize) * 1.00e-06 (base_lr) Summoning checkpoint. logs/2022-11-21T22-57-47_configgit/checkpoints/last.ckpt
Traceback (most recent call last): File "main.py", line 724, in trainer.fit(model, data) File "/home/huangzhiwei/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 454, in fit self.data_connector.attach_data( File "/home/huangzhiwei/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 87, in attach_data self.attach_datamodule(model, datamodule=datamodule) File "/home/huangzhiwei/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 125, in attach_datamodule if is_overridden(method, datamodule): File "/home/huangzhiwei/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/utilities/model_helpers.py", line 42, in is_overridden is_overridden = instance_attr.code is not super_attr.code AttributeError: 'functools.partial' object has no attribute 'code'
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "main.py", line 726, in melk() File "main.py", line 707, in melk trainer.save_checkpoint(ckpt_path) File "/home/huangzhiwei/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/trainer/properties.py", line 330, in save_checkpoint self.checkpoint_connector.save_checkpoint(filepath, weights_only) File "/home/huangzhiwei/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 391, in save_checkpoint _checkpoint = self.dump_checkpoint(weights_only) File "/home/huangzhiwei/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 276, in dump_checkpoint 'state_dict': model.state_dict(), AttributeError: 'NoneType' object has no attribute 'state_dict'
I would appreaciate if you have any idea about this.
Solved. Just because of the pytorch-lightning incompatible version.
Hello, thank you for your comments and code. I have a question about the LDM 'segmentation'-conditional pipeline. As I understand it, the conditional info in this case is concatenenated with the noisy encoded input that is fed into the UNet model. How is this done? I mean, the input is first fed through the VQGAN model and is transformed to a latent representation. Does the segmentation map have to be encoded to a latent too? Does this mean we have to train a separate VQGAN just for segmentation maps? Also, what is the purpose of the SpatialRescaler module in this case. As I understand, HxWx3 input images are transformed to hxwx3 latents with the pretrained VQGAN, but segmentation maps are just interpolated through two downsampling stages (scale=0.5) in order to reach the same dimensionality and are also fed through a 1x1 convolution in order to turn the 182 input channels to just 3. Am I correct?
Please @otamic, I would appreciate any explaination or comment on this.
@GiannisPikoulis
In my opinion, the purpose of the condition stage model or SpatialRescaler in this case is to map the segmentation to the same dimension as the input image been mapped to. Then, these two intermediate representations can be concatenenated and fed to the Unet.
In the 2.2 section of this paper about the conditional DM, it says The only modification that needs to be made is to inject c as a extra input to the neural network function approximators
. And the 'interpolation and concatenation' is just one way to inject the condition. I think it will still work when you change the way SpatialRescaler mapping the segmentation or use 'crossattn' conditioning mechanic the ldm used.
Hi guys, I have made some attempt to train the layout to image model.
First, prepare the data. I chose to use the coco dataset, and the folder structure is like this. Then, the dataset class is like:
from taming.data.annotated_objects_coco import AnnotatedObjectsCoco
class COCOTrain(AnnotatedObjectsCoco):
def __init__(self, size):
super().__init__(data_path='YOUR DATA PATH/coco',
split='train',
keys=['image', 'objects_bbox'],
no_tokens=8192,
target_image_size=size,
min_object_area=0.00001,
min_objects_per_image=2,
max_objects_per_image=30,
crop_method='center',
random_flip=False,
use_group_parameter=True,
encode_crop=True)
class COCOValidation(AnnotatedObjectsCoco):
def __init__(self, size):
super().__init__(data_path='YOUR DATA PATH/coco',
split='validation',
keys=['image', 'objects_bbox'],
no_tokens=8192,
target_image_size=size,
min_object_area=0.00001,
min_objects_per_image=2,
max_objects_per_image=30,
crop_method='center',
random_flip=False,
use_group_parameter=True,
encode_crop=True)
And config file:
model:
base_learning_rate: 2.0e-06
target: ldm.models.diffusion.ddpm.Layout2ImgDiffusion
params:
linear_start: 0.0015
linear_end: 0.0205
log_every_t: 100
timesteps: 1000
loss_type: l1
first_stage_key: image
cond_stage_key: objects_bbox
image_size: 64
channels: 3
conditioning_key: crossattn
cond_stage_trainable: true
scheduler_config: # 10000 warmup steps
target: ldm.lr_scheduler.LambdaLinearScheduler
params:
warm_up_steps: [ 10000 ]
cycle_lengths: [ 10000000000000 ]
f_start: [ 1.e-6 ]
f_max: [ 1. ]
f_min: [ 1. ]
unet_config:
target: ldm.modules.diffusionmodules.openaimodel.UNetModel
params:
image_size: 64
in_channels: 3
out_channels: 3
model_channels: 128
attention_resolutions:
- 8
- 4
- 2
num_res_blocks: 2
channel_mult:
- 1
- 2
- 3
- 4
num_head_channels: 32
use_spatial_transformer: true
transformer_depth: 3
context_dim: 512
first_stage_config:
target: ldm.models.autoencoder.VQModelInterface
params:
ckpt_path: models/first_stage_models/vq-f4/model.ckpt
embed_dim: 3
n_embed: 8192
monitor: val/rec_loss
ddconfig:
double_z: false
z_channels: 3
resolution: 256
in_channels: 3
out_ch: 3
ch: 128
ch_mult:
- 1
- 2
- 4
num_res_blocks: 2
attn_resolutions: []
dropout: 0.0
lossconfig:
target: torch.nn.Identity
cond_stage_config:
target: ldm.modules.encoders.modules.BERTEmbedder
params:
n_embed: 512
n_layer: 16
vocab_size: 8192
max_seq_len: 92
use_tokenizer: false
monitor: val/loss_simple_ema
data:
target: main.DataModuleFromConfig
params:
batch_size: 8
wrap: false
num_workers: 5
train:
target: ldm.data.coco.COCOTrain
params:
size: 256
validation:
target: ldm.data.coco.COCOValidation
params:
size: 256
lightning:
callbacks:
image_logger:
target: main.ImageLogger
params:
batch_frequency: 5000
max_images: 8
increase_log_steps: False
trainer:
benchmark: True
At last:
I don't know what's the difference between coordinates_bbox
in ddpm.py and objects_bbox
here, so I simply replace all coordinates_bbox
with objects_bbox
in ddpm.py. (Someone can tell the difference?)
And make sure this font exists.
Then, use python main.py --base <config_above>.yaml -t --gpus 0,
to train the model.
Note that: My remote machine broke down before I could complete my training. I'm not sure whether I can reconnect that machine, and It will spent some days to train it again elsewhere (I'm not sure whether I can find some other machines too). So I decide to post it here first, hoping someone will train the model and tell the results. From my last training results, the model did try to do the job but it is not so perfect because of the lack of training time. The model structure described above is different from the that used for the coco dataset in the original work (described in Table 15 of the ldm paper), but I don't think it matters.
Hi @otamic, I was doing a very similar approach but I train on the Cityscapes dataset and I only train the BertEmbedder, everything else I use the weights from stable diffusion 1.5. So far I do not have enough training yet (maybe 7 hours), but I get results which are already a little bit understandable and show resemblance of the intended layout.
@GiannisPikoulis
In my opinion, the purpose of the condition stage model or SpatialRescaler in this case is to map the segmentation to the same dimension as the input image been mapped to. Then, these two intermediate representations can be concatenenated and fed to the Unet.
In the 2.2 section of this paper about the conditional DM, it says
The only modification that needs to be made is to inject c as a extra input to the neural network function approximators
. And the 'interpolation and concatenation' is just one way to inject the condition. I think it will still work when you change the way SpatialRescaler mapping the segmentation or use 'crossattn' conditioning mechanic the ldm used.
@otamic So, is my undestanding correct? The segmentation maps are not transformed into latents through a VQGAN. They are just downsampled in order to match the dimensions of the input image latents.
@otamic hello bro! Your work is great! I have trained semantic image synthesis model on coco but the result is bad. I want to know the flickr dataset how to construction? can you share the dataset wit me ?or tell the method to construction this dataset. thanks!!!
@otamic Thanks for your explanation so I can finally produce semantic image synthesis. However, I would like to try to produce fixed results corresponding to the same random seed. However, I have tried different ways to fixed the random seed and make sure random numbers in numpy and torch are exactly the same after assigning the same random seed. I also follow the following post and colab example which tried to generate the same images in stable diffusion by fixing the random seed and it works very well!
https://huggingface.co/CompVis/stable-diffusion-v1-4/discussions/15
However, the results I generate every time are still very different!
mask:
results:
Hello, I would like to ask the difference between unconditional LDM and conditional LDM. After the model is trained, is unconditional sampling generate image randomly, but not based on a given image? So, if I want to generate a normal image from a flawed image (without any annotations in the inference phase), should I use conditional LDM? @shunk031 @AlexofNTU @mauerflitzer @mmash98 @Feanor007
Does anyone realize the code for super-resolution? Thx very much!
Hi,
I managed to train the semantic image synthesis model. I first collected the flickr data according to readme from taming-transformers repo, and used sflckr.py as training dataset.
Then, I wrote the yaml config file according to yaml config file:
model: base_learning_rate: 1.0e-06 target: ldm.models.diffusion.ddpm.LatentDiffusion params: linear_start: 0.0015 linear_end: 0.0205 log_every_t: 100 timesteps: 1000 loss_type: l1 first_stage_key: image cond_stage_key: segmentation image_size: 64 channels: 3 concat_mode: true cond_stage_trainable: true scheduler_config: # 10000 warmup steps target: ldm.lr_scheduler.LambdaLinearScheduler params: warm_up_steps: [ 10000 ] cycle_lengths: [ 10000000000000 ] f_start: [ 1.e-6 ] f_max: [ 1. ] f_min: [ 1. ] unet_config: target: ldm.modules.diffusionmodules.openaimodel.UNetModel params: image_size: 64 in_channels: 6 out_channels: 3 model_channels: 128 attention_resolutions: - 32 - 16 - 8 num_res_blocks: 2 channel_mult: - 1 - 4 - 8 num_heads: 8 first_stage_config: target: ldm.models.autoencoder.VQModelInterface params: embed_dim: 3 n_embed: 8192 ckpt_path: models/first_stage_models/vq-f4/model.ckpt ddconfig: double_z: false z_channels: 3 resolution: 256 in_channels: 3 out_ch: 3 ch: 128 ch_mult: - 1 - 2 - 4 num_res_blocks: 2 attn_resolutions: [] dropout: 0.0 lossconfig: target: torch.nn.Identity cond_stage_config: target: ldm.modules.encoders.modules.SpatialRescaler params: n_stages: 2 in_channels: 182 out_channels: 3 data: target: main.DataModuleFromConfig params: batch_size: 12 num_workers: 5 wrap: False train: target: ldm.data.flickr.FlickrSegTrain # PUT YOUR DATASET params: size: 256 validation: target: ldm.data.flickr.FlickrSegEval # PUT YOUR DATASET params: size: 256 lightning: callbacks: image_logger: target: main.ImageLogger params: batch_frequency: 5000 max_images: 8 increase_log_steps: False trainer: benchmark: True
And last, I followed
python main.py --base <config_above>.yaml -t --gpus 0,
to train the model.It did work. Here is a result coming from my training process:
conditions
samples
By the way, I find that the config yaml file doesn't load ckpt at first stage config
first_stage_config: target: ldm.models.autoencoder.VQModelInterface params: embed_dim: 3 n_embed: 8192 ckpt_path: models/first_stage_models/vq-f4/model.ckpt # this line is missing ddconfig: double_z: false
I wonder whether this is the reason for failing in inference.
@otamic hi, the training results looks like really nice, how many epoches did you set in the training process? I trained 200 epoches, but I found that the results are little bad.
@mauerflitzer @otamic when you trained the Layout2Image I noticed the BertEmbedder was used for the condition stage. Can you explain this because it confused me. I thought the BertEmbedder was for caption conditioning only? how does it work in this case? thanks
@mauerflitzer did you get any good results from your training? I have been trying to train from scratch for Layout to Image but the training is taking too long. I used ColabPro+ but keep getting disconnected. Is using a checkpoint for 1.5 feasible?
This is my result from 9 epochs:
@otamic Thanks for your explanation so I can finally produce semantic image synthesis. However, I would like to try to produce fixed results corresponding to the same random seed. However, I have tried different ways to fixed the random seed and make sure random numbers in numpy and torch are exactly the same after assigning the same random seed. I also follow the following post and colab example which tried to generate the same images in stable diffusion by fixing the random seed and it works very well!
https://huggingface.co/CompVis/stable-diffusion-v1-4/discussions/15
However, the results I generate every time are still very different! mask:
results:
![]()
![]()
Hi, did you solve this issue?
Can you provide inference scripts for semantic image synthesis and layout-to-image synthesis? I tried to use data loaders from the
taming-transformers
repo but got random noise outputs. The evaluation results are far from those reported in the paper. Thanks!