Despite the names; these architectures couldn't really be more different. As such you would of course need to train a new model if you wanted this to work with GTA V, unfortunately.
Having said that however; the VQGAN presented by CompVis is highly effiicient and represents images with much better fidelity.
You can see their pretrained 1024 token model in action at DALLE-pytorch for just one example. It helped us significantly improve the image quality compared to using a traditional VAE.
I'm also still learning how this repo specifically works. With the VQGAN; you generally are given a single token which represents 16x16 pixel square samples.
One way that you may be able to get this to work would be to train a sort of multimodal transformer where you encode user input as one modality and visuals from the game as another. With the transformer approach you might also need to consider the time axis because you will be taking previous frames as input, is my understanding.
Unfortunately I'm still learning about a lot of this stuff; but I think this project is awesome for teaching purposes so I'd love to see it continue to grow and improve if possible!
There's a lot of new research in this area including the newly released Alias-Free-GAN paper - which could also significantly improve results while still maintaining a similar architecture.
Are there any plans to take advantage of these recent improvements? I assume the core team is still busy fixing things up and may not have even had time to consider any of this stuff - so "no" is certainly an acceptable answer :)
Despite the names; these architectures couldn't really be more different. As such you would of course need to train a new model if you wanted this to work with GTA V, unfortunately.
Having said that however; the VQGAN presented by CompVis is highly effiicient and represents images with much better fidelity.
https://github.com/CompVis/taming-transformers
You can see their pretrained 1024 token model in action at DALLE-pytorch for just one example. It helped us significantly improve the image quality compared to using a traditional VAE.
I'm also still learning how this repo specifically works. With the VQGAN; you generally are given a single token which represents 16x16 pixel square samples.
One way that you may be able to get this to work would be to train a sort of multimodal transformer where you encode user input as one modality and visuals from the game as another. With the transformer approach you might also need to consider the time axis because you will be taking previous frames as input, is my understanding.
Unfortunately I'm still learning about a lot of this stuff; but I think this project is awesome for teaching purposes so I'd love to see it continue to grow and improve if possible!
There's a lot of new research in this area including the newly released Alias-Free-GAN paper - which could also significantly improve results while still maintaining a similar architecture.
Are there any plans to take advantage of these recent improvements? I assume the core team is still busy fixing things up and may not have even had time to consider any of this stuff - so "no" is certainly an acceptable answer :)