Closed TakeruShiraishi closed 1 year ago
Thanks for your attention.
The dimension of the latent space is adjustable. We choose 512 for a fair comparison with Diff-AE. Theoretically, the larger the dimension is, the better the performance will be. In my experience, 32 is enough for MNIST. If your autoencoding reconstruction is visually imperceptible, it is resonable.
You mentioned that when manipulate the latent code in a specific direction, the image has drastically changed. Is the manipulated images still in domain? If not, you can try to decrease the manipulation strength or not use the guidance in last 30% sampling steps. If so, the learned direction may be not accurate, which may be because the wrong annotation or lack of data.
You can directly calculate $mean(z{i}^{+}) - mean(z{j}^{-1})$ as an editing direction, without any training. Furthermore, you can try some unsupervised method, such as Principal Component Analysis (PCA).
Thank you very much! I will follow your advice and try other experiments.
Thank you for releasing your code! Since it is difficult to collect large amounts of datasets, I would like to reduce the dimension of the latent space. I tried it on my custom datasets, but when manipulate the latent code in a specific direction, I found that the image has drastically changed. I expected the latent space as linear structure, but this may be not the case in the lower dimensional latent space. The default 512 dimension is required for representation learning? Alternatively, besides training the classifier, is there a more effortless strategy to discover the semantic direction of the latent space?