Understanding Encoder Update Mechanism in Structure VQ-VAE

Hi ESM3 Team,

First of all, congratulations on your outstanding research work. I am particularly excited by the VQ-VAE structure proposed in your model.

Upon examining your code and detailed appendices, I observed that you use a Euclidean codebook to compute the quantized version of your codes. Additionally, you extract encoding indices which define the indices of your tokens which are then utilized by the decoder in an embedding layer.

My question pertains to the gradient flow in your model. Given that the argmin operation, which extracts the token indices, is non-differentiable, how is the encoder updated with respect to the reconstruction loss ? As far as I know, in vanilla VQVAE, a STE allows gradients to bypass the quantization step. However, in your implementation, it seems that the non-quantized version is not utilized in this manner. Could you please explain how the encoder receives gradients and is updated in your setup?

evolutionaryscale / esm

Understanding Encoder Update Mechanism in Structure VQ-VAE #38