astronomy-commons / lsdb

Large Survey DataBase
https://lsdb.io
BSD 3-Clause "New" or "Revised" License
19 stars 5 forks source link

`From_flat` catalog creation #434

Open nevencaplar opened 1 month ago

nevencaplar commented 1 month ago

Provide a function, provisionally called from_flat that creates a catalog, with nested structure from a single source table, i.e., list of observations of astronomical objects, where observations can repeat.

Connected with https://github.com/astronomy-commons/lsdb/issues/421.

@dougbrn Can you elaborate further or explain what I might have gotten wrong?

dougbrn commented 1 month ago

Looks good, would just add the (maybe obvious) rider that this functionality is already available within Nested-Dask: https://nested-dask.readthedocs.io/en/latest/autoapi/nested_dask/core/index.html#nested_dask.core.NestedFrame.from_flat

So this ticket would just be creating a catalog function that directly wraps/uses this.

dougbrn commented 1 month ago

@hombit do you think it would be good to follow #421 here and provide a nest_flat function within the catalog class? Or do you think this should diverge from #421 and be a catalog constructor class? Or just do both?

hombit commented 1 month ago

I believe it should be consistent with nest_lists, because these two are very close to each other.

hombit commented 3 weeks ago

Implementing from_flat would have a challenge in generating a new _healpix_29 index. Generally, the original _healpix_29 would be different for different observations of a single object (e.g. for Zubercal, LSST DRs, etc.). This is a pipeline I'd propose to have (for each pixel):

  1. Concatenate catalog and margin partitions and NestedFrame.from_flat(df.reset_index(), on='column_name', name='lc')
  2. Now we have a nested column lc with the _healpix_29 subcolumn. First, we should split the df into "catalog" and "margin" dfs. "margin" would have objects having all ls._healpix_29 list-values to lie out of the partition pixel. "catalog" will include all other objects.
  3. Then, we aggregate ls._healpix_29 to a new index value. There could be different strategies, but basically, we want to select one of the _healpix_29 values to be a new index, and for the "catalog" df, it should be the _healpix_29 value within the partition. a. A possible strategy is just selecting the smallest _healpix_29 value, b. or a tile order-29 closest to the average coordinates (which we can get converting healpix-29 to RA&Dec)
  4. Reindex both "catalog" and "margin" dfs with this new index and construct a new Catalog object