CompVis / taming-transformers

Taming Transformers for High-Resolution Image Synthesis
https://arxiv.org/abs/2012.09841
MIT License
5.68k stars 1.13k forks source link

How would I do RGB 2 RGB image2image translation with this repo? #51

Open adeptflax opened 3 years ago

adeptflax commented 3 years ago

I have 512x512 pixel images I would like to do image2image translation on.

adeptflax commented 3 years ago

I don't understand how the config works.

adeptflax commented 3 years ago

I think I figured out how to do this. I'll try training a model tomorrow.

Guthman commented 3 years ago

@adeptflax Can you share your code?

adeptflax commented 3 years ago

@Guthman I'm still working on it. I got it to train. I need to test the model.

adeptflax commented 3 years ago

I'll publish the code once I get it working

1211sh commented 3 years ago

Can you share your intuition? I have no idea to revise this to work on I2I task.

adeptflax commented 3 years ago

The codebase is pretty much spaghetti code. I tried modifying drin, because it was doing something similar to image2image. The way I tried to modify it didn't work. I think I know one of the problems.

adeptflax commented 3 years ago

I think I got it working. I only have the first epoch of my model trained. I need to wait for it to finish to know for sure. I'll write a guide and the publish the code I used.

adeptflax commented 3 years ago

I had to fix something, but I did seem to have gotten it working. I'll post guide tomorrow if it works well.

adeptflax commented 3 years ago

Sorry guys, I procrastinated for a couple of days. I have gotten code to work that can train and run a image2image model. I don't know how it compares to pix2pixHD. I slightly screwed up input data on the dataset I was training on, through I should be able to recover from it without completely restarting training.

adeptflax commented 3 years ago

Here it is. Should work. https://github.com/adeptflax/image2image

adeptflax commented 3 years ago

@Guthman @1211sh I don't seem to get that good of results by epoch 36 on around 11,000 training examples. Does it just need to trained for longer or does something need to be changed? Any guesses? My output is faces, hair and eyebrows don't have detail.

Guthman commented 3 years ago

I don't remember where I read it (can't find it atm), but I think the authors trained theirs for five days on a V100 or something similar. So I think you have a bit to go. I'm training one for a bit on portrait paintings (~40k images), and although the reconstructions are started to look okay (after 34 epochs I think):

reconstructions_gs-091070_e-000080_b-000750

the validation examples weren't close to acceptable: vq_val

I basically copied the imagenet config but used a batch size of 8

I switched to StyleGAN2-ADA to finish my current project, but I'll come back to VQGAN.

adeptflax commented 3 years ago

@Guthman I saved the model output. and I just used pix2pixHD. Through pix2pixHd doesn't do as good as I need. Do you think random crop would help?

adeptflax commented 3 years ago

Maybe using transformers instead of just vggan would work? Maybe it's possible to pretrain on a face dataset? I'm doing stuff with faces.

adeptflax commented 3 years ago

I'm trained on 2 RTX 3090s for 2 days I think. So I would have to train for another 6 days, because of training because 512x512 is 4 times larger than 256x256?

adeptflax commented 3 years ago

@Guthman what's the resolution of your dataset?

adeptflax commented 3 years ago

Do transformer models first pre-train with vqgan and then do training on transformers?

adeptflax commented 3 years ago

I wonder what the problem is on https://github.com/CompVis/taming-transformers/issues/52.

adeptflax commented 3 years ago

actually it seems you need to first train to train a vqgan model than you can a train transformer. Maybe that's the the problem with #52. You would first train a model with faceshq_vqgan.yaml and then train a transformer with faceshq_transformer.yaml using the first vqgan model.

adeptflax commented 3 years ago

Does the transformer just modify the encodings?

adeptflax commented 3 years ago

ok I seem to be correct. In drin they created a depth vqgan and an imagenet vqgan model. So they whole drin goes depth vqgan model -> transformer -> image vqgan model. So basically the drin_transformer.yaml trains a model that converts the depth embeddings into imagenet embeddings.

adeptflax commented 3 years ago

I modified the reconstruction code to do x -> y instead of x -> x in my repo. Which isn't correct.

adeptflax commented 3 years ago

@Guthman did you set n_embed to 16384 or no? "model.params.n_embed" should be 16384.

adeptflax commented 3 years ago

ok I got an image2image transformer working I will submit pull request in the next few days.