Open adeptflax opened 3 years ago
I don't understand how the config works.
I think I figured out how to do this. I'll try training a model tomorrow.
@adeptflax Can you share your code?
@Guthman I'm still working on it. I got it to train. I need to test the model.
I'll publish the code once I get it working
Can you share your intuition? I have no idea to revise this to work on I2I task.
The codebase is pretty much spaghetti code. I tried modifying drin, because it was doing something similar to image2image. The way I tried to modify it didn't work. I think I know one of the problems.
I think I got it working. I only have the first epoch of my model trained. I need to wait for it to finish to know for sure. I'll write a guide and the publish the code I used.
I had to fix something, but I did seem to have gotten it working. I'll post guide tomorrow if it works well.
Sorry guys, I procrastinated for a couple of days. I have gotten code to work that can train and run a image2image model. I don't know how it compares to pix2pixHD. I slightly screwed up input data on the dataset I was training on, through I should be able to recover from it without completely restarting training.
Here it is. Should work. https://github.com/adeptflax/image2image
@Guthman @1211sh I don't seem to get that good of results by epoch 36 on around 11,000 training examples. Does it just need to trained for longer or does something need to be changed? Any guesses? My output is faces, hair and eyebrows don't have detail.
I don't remember where I read it (can't find it atm), but I think the authors trained theirs for five days on a V100 or something similar. So I think you have a bit to go. I'm training one for a bit on portrait paintings (~40k images), and although the reconstructions are started to look okay (after 34 epochs I think):
the validation examples weren't close to acceptable:
I basically copied the imagenet config but used a batch size of 8
I switched to StyleGAN2-ADA to finish my current project, but I'll come back to VQGAN.
@Guthman I saved the model output. and I just used pix2pixHD. Through pix2pixHd doesn't do as good as I need. Do you think random crop would help?
Maybe using transformers instead of just vggan would work? Maybe it's possible to pretrain on a face dataset? I'm doing stuff with faces.
I'm trained on 2 RTX 3090s for 2 days I think. So I would have to train for another 6 days, because of training because 512x512 is 4 times larger than 256x256?
@Guthman what's the resolution of your dataset?
Do transformer models first pre-train with vqgan and then do training on transformers?
I wonder what the problem is on https://github.com/CompVis/taming-transformers/issues/52.
actually it seems you need to first train to train a vqgan model than you can a train transformer. Maybe that's the the problem with #52. You would first train a model with faceshq_vqgan.yaml and then train a transformer with faceshq_transformer.yaml using the first vqgan model.
Does the transformer just modify the encodings?
ok I seem to be correct. In drin they created a depth vqgan and an imagenet vqgan model. So they whole drin goes depth vqgan model -> transformer -> image vqgan model. So basically the drin_transformer.yaml trains a model that converts the depth embeddings into imagenet embeddings.
I modified the reconstruction code to do x -> y instead of x -> x in my repo. Which isn't correct.
@Guthman did you set n_embed to 16384 or no? "model.params.n_embed" should be 16384.
ok I got an image2image transformer working I will submit pull request in the next few days.
I have 512x512 pixel images I would like to do image2image translation on.