astronomy-commons / lsdb

Large Survey DataBase
https://lsdb.io
BSD 3-Clause "New" or "Revised" License
19 stars 5 forks source link

Hold and use nested column metadata #466

Open hombit opened 3 weeks ago

hombit commented 3 weeks ago

We are getting an initial nested-column support with Hats/LSDB ecosystem now. Now we have a couple of catalogs (ZTF alerts, SDSS DR7 spectra) with nested lists that represent nested data we could pack to a single nested column after we read the data.

Today we can nest these list-columns with code like this one:

from lsdb import read_hats

raw_catalog = read_hats('https://data.lsdb.io/hats/alerce/')
catalog_with_lc = raw_catalog.nest_lists(
    base_columns=[col for col in raw_catalog.columns if not col.startswith('lc_')],
    name='lc',
)
catalog_with_nondet = catalog_with_lc.nest_lists(
    base_columns=[col for col in catalog_with_lc.columns if not col.startswith('nondet_')],
    name='nondet',
)
catalog = catalog_with_nondet.nest_lists(
    base_columns=[col for col in catalog_with_nondet.columns if not col.startswith('ref_')],
    name='ref',
)

This works, but it is not a perfect user experience: how would user know which columns can be packed (here it is with name prefixes, but it is not scalable and ugly), how does user save a catalog to the initial format when calling to_hats?

We can solve these issues with a better nested columns support across the ecosystem: