AstroPile / FlatironMeeting2024

AstroPile meet-up at the Flatiron Institute
https://astropile.github.io/FlatironMeeting2024/
MIT License
2 stars 3 forks source link

[Baselines] [Metrics] Benchmarking embedding alignment with GPTs πŸ€– and CLIPs πŸ“Ž #5

Open Smith42 opened 6 months ago

Smith42 commented 6 months ago

Benchmarking embedding alignment with GPTs πŸ€– and CLIPs πŸ“Ž

Contacts: Mike (S), Marc Participants:

Goals and deliverable

  1. Developing AstroPT to the point where it can ingest and generate embeddings from Francois' & Liam's galaxy image/spectrum aligned dataset that was used in AstroCLIP, and is now in the *Pile.
    • We can likely generate embeddings autoregressively via "astro-sentences" in a similar way that is done in Bai+2023. The galaxy/spectrum pairs will be aligned in the embedding space through being in proximity with each other at pretraining time.
  2. Once we have the embeddings, we would want to define and test a good metric for how well each method does. This could be a simple downstream task like classification, or maybe some linear probe dark magick? This task could then be used as a standard benchmark in the AstroPile project.
  3. πŸš€ stretch goal πŸš€ do we see a scale (in terms of trainable parameters) where AstroPT outperforms AstroCLIP?

Resources needed

We'd probably need a fair amount of GPUs for pretraining πŸ˜„, plus some enthusiastic people to help get this working

Rough checklist

Smith42 commented 6 months ago

I have some code that can extract embeddings from AstroPT now, and equivalent code is available in AstroCLIP. Next to do is get the code ready to ingest gal image/spectrum pairs from here, eta ~few hours πŸ˜„

Some thoughts about embedding space metrics:

embeddings_z_64t

Smith42 commented 6 months ago

FYI to download Francois' AstroCLIP dataset I use the following code:

from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="EiffL/AstroCLIP",
    repo_type="dataset",
    local_dir="./",
    local_dir_use_symlinks=False,
    cache_dir="/raid/data/cache",
)

Then the parquets can be loaded with pandas

Smith42 commented 6 months ago

Some foundation model benchmark papers that are knocking about in Earth Observation for inspo: