Closed EiffL closed 1 year ago
After further thinking, we only have 10^5 matching spectra and galaxies. We can obtain more by merging other datasets but then that immediately raises the question of how to handle diverse data sources. To try to answer that question in the self-supervised setting first, I'm going to play with Dino for a couple of days. https://arxiv.org/abs/2304.07193
And I'm opening a separate repo for that work here: https://github.com/FoundationModelsForScience/AstroDino
This issue is to document all the action items needed to reach a full paper draft for this project, along with specific todos.
Dataset
Currently, about 10 TB of images from the DESI Legacy Survey (taken from this paper/repo: https://github.com/georgestein/ssl-legacysurvey) are on Rusty, as well as a file with a cross-matching of all galaxy spectra from the DESI EDR that share the same unique id.
The steps needed to compile the data have been downloading the images through GLOBUS, then using this script to find the matching spectra and downloading them: https://github.com/FoundationModelsForScience/AstroCLIP/blob/main/scripts/cross_match_data.py
The data is currently hosted here:
/mnt/home/flanusse/ceph
Most the pre-processing is already done, but in order to make the data easily accessible for training the following points remain to be done:
Embedding Architecture
With the data in hand, we need to find a good architecture to embed these 2 different data modalities. A convolutional embedder is pretty trivial to do for both modalities, but it would be nice to have also transformer-based models and compare efficiency.
Note: A transformer for galaxy spectra I think has never been done yet in astro (I think) so there are a few non-trivial questions about what would be good choices to make there.
Contrastive Training
With embedding architectures in hand, we move on to training. There are several steps and interesting questions there.
Evaluation/Application
Once models are up and running, we can think about what we want to demonstrate with this model
(Optional) Beyond CLIP
Once embeddings are trained, we can use them for downstream tasks such as conditioning LLMs with them. The steps included here are not needed for a first papers, but worthwhile to think about