Game Plan for AstroCLIP PoC paper

This issue is to document all the action items needed to reach a full paper draft for this project, along with specific todos.

Dataset

Currently, about 10 TB of images from the DESI Legacy Survey (taken from this paper/repo: https://github.com/georgestein/ssl-legacysurvey) are on Rusty, as well as a file with a cross-matching of all galaxy spectra from the DESI EDR that share the same unique id.

The steps needed to compile the data have been downloading the images through GLOBUS, then using this script to find the matching spectra and downloading them: https://github.com/FoundationModelsForScience/AstroCLIP/blob/main/scripts/cross_match_data.py

The data is currently hosted here: /mnt/home/flanusse/ceph

Most the pre-processing is already done, but in order to make the data easily accessible for training the following points remain to be done:

[x] #2

Embedding Architecture

With the data in hand, we need to find a good architecture to embed these 2 different data modalities. A convolutional embedder is pretty trivial to do for both modalities, but it would be nice to have also transformer-based models and compare efficiency.

Note: A transformer for galaxy spectra I think has never been done yet in astro (I think) so there are a few non-trivial questions about what would be good choices to make there.

[x] #5
[ ] Implement a transformer-based embedder for images
[x] #6

Contrastive Training

With embedding architectures in hand, we move on to training. There are several steps and interesting questions there.

[ ] Implement simple multimodal contrastive training (train from scratch by contrast between modalities)
[ ] Implement self-supervised pre-training followed by contrastive embedding
[ ] Implement joint SSL + multimodal contrastive training

Evaluation/Application

Once models are up and running, we can think about what we want to demonstrate with this model

[ ] Demonstrate cross-modal similarity retrieval
[ ] Evaluate how informative the embeddings are for other physical parameters (e.g. predict redshift, stellar mass, galaxy type)

(Optional) Beyond CLIP

Once embeddings are trained, we can use them for downstream tasks such as conditioning LLMs with them. The steps included here are not needed for a first papers, but worthwhile to think about

[ ] Assemble training set of images, text, and associated raw data
[ ] Train a LLaMAdapter/

PolymathicAI / AstroCLIP