PolymathicAI / AstroCLIP

Multimodal contrastive pretraining for astronomical data
MIT License
77 stars 12 forks source link

Game Plan for AstroCLIP PoC paper #1

Closed EiffL closed 1 year ago

EiffL commented 1 year ago

This issue is to document all the action items needed to reach a full paper draft for this project, along with specific todos.

Dataset

Currently, about 10 TB of images from the DESI Legacy Survey (taken from this paper/repo: https://github.com/georgestein/ssl-legacysurvey) are on Rusty, as well as a file with a cross-matching of all galaxy spectra from the DESI EDR that share the same unique id.

The steps needed to compile the data have been downloading the images through GLOBUS, then using this script to find the matching spectra and downloading them: https://github.com/FoundationModelsForScience/AstroCLIP/blob/main/scripts/cross_match_data.py

The data is currently hosted here: /mnt/home/flanusse/ceph

Most the pre-processing is already done, but in order to make the data easily accessible for training the following points remain to be done:

Embedding Architecture

With the data in hand, we need to find a good architecture to embed these 2 different data modalities. A convolutional embedder is pretty trivial to do for both modalities, but it would be nice to have also transformer-based models and compare efficiency.

Note: A transformer for galaxy spectra I think has never been done yet in astro (I think) so there are a few non-trivial questions about what would be good choices to make there.

Contrastive Training

With embedding architectures in hand, we move on to training. There are several steps and interesting questions there.

Evaluation/Application

Once models are up and running, we can think about what we want to demonstrate with this model

(Optional) Beyond CLIP

Once embeddings are trained, we can use them for downstream tasks such as conditioning LLMs with them. The steps included here are not needed for a first papers, but worthwhile to think about

EiffL commented 1 year ago

After further thinking, we only have 10^5 matching spectra and galaxies. We can obtain more by merging other datasets but then that immediately raises the question of how to handle diverse data sources. To try to answer that question in the self-supervised setting first, I'm going to play with Dino for a couple of days. https://arxiv.org/abs/2304.07193

And I'm opening a separate repo for that work here: https://github.com/FoundationModelsForScience/AstroDino