PolymathicAI / AstroCLIP

Multimodal contrastive pretraining for astronomical data
MIT License
77 stars 12 forks source link

Adds Implementation of Joint Image+Spectra Dataset #3

Closed EiffL closed 1 year ago

EiffL commented 1 year ago

This PR adds a Huggingface dataset with matching images and spectra.

EiffL commented 1 year ago

It's currently working, but there is a tiny problem in the ordering of the training/testing dataset. By default images are sorted from brightest to faintest, and I was selecting the last objects for testing. Which immediately translates to a big distribution shift between samples.

This can be fixed by randomizing the order of each file and keeping some fraction of all of them.

EiffL commented 1 year ago

It works by doing the following:

from datasets import load_dataset

dset = load_dataset('astroclip/datasets/legacy_survey.py')

example = dset['train'][6]
EiffL commented 1 year ago

ok, this works!