The pretrained encoder gives us a very large latent space (dim=2048), which results in way too many parameters in the decoders and feature-classifier (even without resblocks). To solve this we can learn a projection at each decoder to reduce the dimensionality.
The pretrained encoder gives us a very large latent space (dim=2048), which results in way too many parameters in the decoders and feature-classifier (even without resblocks). To solve this we can learn a projection at each decoder to reduce the dimensionality.