Closed brendancol closed 4 years ago
self-assigned JIRA ticket corresponding to this here: https://jira.lsstcorp.org/browse/DM-21335?filter=-1
As I understand, the tables I need to precalculate in this task are the following:
1) Reorganized coadd tables (forced/unforced) will contain:
metadata.yaml
2) Reorganized visit table will contain:
Yes, those tables look right.
Will we still have separate forced/unforced tables? Will only desired columns be in the table (i.e. can we do away with metadata.yaml)?
For now, let's keep both separate forced/unforced tables. And good point about the metadata.yaml
possibly being obsolete, as all the required info could be inferred from the proper tables---the way I see it, the pipeline task will be where one defines which columns are desired, and those are the only ones that should make it to the tables the dashboard ingests (though we should keep in mind that we will want to be able to add additional arbitrary columns, or computations on columns, from the original data---this will necessarily involve some reshaping of those columns, but only on demand).
We want to be able to do something like the following:
coadd_dashboard_table = butler.get('coadd_dashboard_table', **kwargs)
ddf = coadd_dashboard_table.toDataFrame(dask=True)
and have ddf
be a dask dataframe with the structure mentioned above. Correct?
That is one option. I was thinking more like dask=True
would be one of the kwargs we would pass to butler.get
, that way we wouldn't need the toDataFrame call. We probably would want to be able to pass in a partitions
parameter too so we could partition it appropriately to the size of dask cluster we have.
While I'm sure that's possible, it's a bit of a headache, as far as I understand, for butler.get()
to take arbitrary keywords like that---it's much easier for me to edit the ParquetTable
object. It should also be relatively easy to allow the fastparquet
engine, to simplify the parquet->dask step, rather than waiting for pyarrow updates.
Got it. In that case I think dask=True
and partitions=n
on the toDataframe call would be the best way to go.
Ok I just verified that if we use fastparquet, I can retrieve the source filename which means I could reopen the parquet file directly with dask.
completed