Add data transforms to LSST pipeline

Quansight / lsst_dashboard

LSST Dashboard https://quansight.github.io/lsst_dashboard/

BSD 3-Clause "New" or "Revised" License

8 stars 3 forks source link

Add data transforms to LSST pipeline #54

Closed brendancol closed 4 years ago

dharhas commented 5 years ago

[ ] Option to load dask DF's from Butler
[ ] prepare larger test dataset
[ ] make meta.yaml a butler dataset
[ ] precalculate tables in format required by dashboard

timothydmorton commented 5 years ago

self-assigned JIRA ticket corresponding to this here: https://jira.lsstcorp.org/browse/DM-21335?filter=-1

timothydmorton commented 5 years ago

As I understand, the tables I need to precalculate in this task are the following:

1) Reorganized coadd tables (forced/unforced) will contain:

columns and flags specified in metadata.yaml
computed columns (ra/dec in degrees, psfMag, label)
all tract/filter pairs concatenated into one table, with added tract/filter columns

2) Reorganized visit table will contain:

a single table for all visits, for all tract/filter pairs, for all the same columns as the coadd table (when available).
a column for the coadd ID from the match tables, in order to be able to compute repeatability metrics.

dharhas commented 5 years ago

Yes, those tables look right.

Will we still have separate forced/unforced tables? Will only desired columns be in the table (i.e. can we do away with metadata.yaml)?

timothydmorton commented 5 years ago

For now, let's keep both separate forced/unforced tables. And good point about the metadata.yaml possibly being obsolete, as all the required info could be inferred from the proper tables---the way I see it, the pipeline task will be where one defines which columns are desired, and those are the only ones that should make it to the tables the dashboard ingests (though we should keep in mind that we will want to be able to add additional arbitrary columns, or computations on columns, from the original data---this will necessarily involve some reshaping of those columns, but only on demand).

timothydmorton commented 5 years ago

We want to be able to do something like the following:

coadd_dashboard_table = butler.get('coadd_dashboard_table', **kwargs)
ddf = coadd_dashboard_table.toDataFrame(dask=True)

and have ddf be a dask dataframe with the structure mentioned above. Correct?

dharhas commented 5 years ago

That is one option. I was thinking more like dask=True would be one of the kwargs we would pass to butler.get, that way we wouldn't need the toDataFrame call. We probably would want to be able to pass in a partitions parameter too so we could partition it appropriately to the size of dask cluster we have.

timothydmorton commented 5 years ago

While I'm sure that's possible, it's a bit of a headache, as far as I understand, for butler.get() to take arbitrary keywords like that---it's much easier for me to edit the ParquetTable object. It should also be relatively easy to allow the fastparquet engine, to simplify the parquet->dask step, rather than waiting for pyarrow updates.

dharhas commented 5 years ago

Got it. In that case I think dask=True and partitions=n on the toDataframe call would be the best way to go.

dharhas commented 5 years ago

Ok I just verified that if we use fastparquet, I can retrieve the source filename which means I could reopen the parquet file directly with dask.

dharhas commented 4 years ago

completed