Open PaulScotti opened 1 year ago
So I wonder the intension you choose to CLIP everybody into shape 257x768 rather than shape 1x768 at very beggining,What's the special thing the hidden layer can express?
In my view,I think that the hidden layer dosen't encode completely to construct an embedding to represent the image.
For MindEye we mapped to CLIP ViT-L/32 final layer (shape 1x768) as well as to CLIP ViT-L/32 last hidden layer (shape 257x768). For the former we found that using a pretrained starting point for fine-tuning the diffusion prior really benefitted performance--we used this pretrained checkpoint text-to-image diffusion prior trained from LAION-Aesthetics. See this github repo for more info: https://github.com/lucidrains/DALLE2-pytorch/tree/main
We would have liked to likewise use a pretrained starting point for training our 257x768 diffusion prior, but no pretrained checkpoint like that exists! If someone trains a 257x768 checkpoint we can use as a starting point, this could really improve MindEye reconstructions!
To tackle this issue you might benefit from joining the #dalle2-prior channel in the LAION Discord server.