Train text-to-image 257x768 diffusion prior to use as pretrained starting point

PaulScotti commented 1 year ago

For MindEye we mapped to CLIP ViT-L/32 final layer (shape 1x768) as well as to CLIP ViT-L/32 last hidden layer (shape 257x768). For the former we found that using a pretrained starting point for fine-tuning the diffusion prior really benefitted performance--we used this pretrained checkpoint text-to-image diffusion prior trained from LAION-Aesthetics. See this github repo for more info: https://github.com/lucidrains/DALLE2-pytorch/tree/main

We would have liked to likewise use a pretrained starting point for training our 257x768 diffusion prior, but no pretrained checkpoint like that exists! If someone trains a 257x768 checkpoint we can use as a starting point, this could really improve MindEye reconstructions!

To tackle this issue you might benefit from joining the #dalle2-prior channel in the LAION Discord server.

AvancierGuo commented 3 months ago

So I wonder the intension you choose to CLIP everybody into shape 257x768 rather than shape 1x768 at very beggining,What's the special thing the hidden layer can express?

AvancierGuo commented 3 months ago

In my view,I think that the hidden layer dosen't encode completely to construct an embedding to represent the image.

MedARC-AI / fMRI-reconstruction-NSD

Train text-to-image 257x768 diffusion prior to use as pretrained starting point #19