Quansight / lsst_dashboard

LSST Dashboard https://quansight.github.io/lsst_dashboard/
BSD 3-Clause "New" or "Revised" License
8 stars 3 forks source link

Add data transforms to LSST pipeline #54

Closed brendancol closed 4 years ago

dharhas commented 5 years ago
timothydmorton commented 5 years ago

self-assigned JIRA ticket corresponding to this here: https://jira.lsstcorp.org/browse/DM-21335?filter=-1

timothydmorton commented 5 years ago

As I understand, the tables I need to precalculate in this task are the following:

1) Reorganized coadd tables (forced/unforced) will contain:

2) Reorganized visit table will contain:

dharhas commented 5 years ago

Yes, those tables look right.

Will we still have separate forced/unforced tables? Will only desired columns be in the table (i.e. can we do away with metadata.yaml)?

timothydmorton commented 5 years ago

For now, let's keep both separate forced/unforced tables. And good point about the metadata.yaml possibly being obsolete, as all the required info could be inferred from the proper tables---the way I see it, the pipeline task will be where one defines which columns are desired, and those are the only ones that should make it to the tables the dashboard ingests (though we should keep in mind that we will want to be able to add additional arbitrary columns, or computations on columns, from the original data---this will necessarily involve some reshaping of those columns, but only on demand).

timothydmorton commented 5 years ago

We want to be able to do something like the following:

coadd_dashboard_table = butler.get('coadd_dashboard_table', **kwargs)
ddf = coadd_dashboard_table.toDataFrame(dask=True)

and have ddf be a dask dataframe with the structure mentioned above. Correct?

dharhas commented 5 years ago

That is one option. I was thinking more like dask=True would be one of the kwargs we would pass to butler.get, that way we wouldn't need the toDataFrame call. We probably would want to be able to pass in a partitions parameter too so we could partition it appropriately to the size of dask cluster we have.

timothydmorton commented 5 years ago

While I'm sure that's possible, it's a bit of a headache, as far as I understand, for butler.get() to take arbitrary keywords like that---it's much easier for me to edit the ParquetTable object. It should also be relatively easy to allow the fastparquet engine, to simplify the parquet->dask step, rather than waiting for pyarrow updates.

dharhas commented 5 years ago

Got it. In that case I think dask=True and partitions=n on the toDataframe call would be the best way to go.

dharhas commented 5 years ago

Ok I just verified that if we use fastparquet, I can retrieve the source filename which means I could reopen the parquet file directly with dask.

dharhas commented 4 years ago

completed