PolymathicAI / AstroCLIP

Multimodal contrastive pretraining for astronomical data
MIT License
77 stars 12 forks source link

Format galaxy images, spectra, and associated pairs as Hugginface datasets #2

Closed EiffL closed 1 year ago

EiffL commented 1 year ago

The goal here is to make the data easily shareable and usable to train models.

The raw data is already stored in hdf format on ceph but I want to build a proper HF Dataset for it.

The documentation for hugging face datasets is here: https://huggingface.co/docs/datasets/dataset_script

and a template is here: https://github.com/huggingface/datasets/blob/main/templates/new_dataset_script.py

EiffL commented 1 year ago

I currently only have an empty file: https://github.com/FoundationModelsForScience/AstroCLIP/blob/main/astroclip/datasets/legacy_survey.py

it's VERY empty ^^' but locally I'm working on a version derived from the template above