Open nsevilla opened 2 years ago
I want to start by laying out some context for how I think about the different aspects:
GCRCatalogs has several different aspects:
GCRCatalogs.translate
: translate column names, redefine column quantities (i.e., scaling for units), mixing quantities (re-defining ellipticity conventions)GCRCatalogs.read
: reading data files: HDF, Parquet, CSV, SQLite, live DBsGCRCatalogs.registry
: The data registry. Where do files live? What is the connection info for the database.These are not actually the names of submodules of GCRCatalogs, but it's conceptually what I want to bring out here. translate
and read
are often found together in a given GCRCatalogs module. E.g.,
https://github.com/LSSTDESC/gcr-catalogs/blob/master/GCRCatalogs/dc2_object.py#L224
defines the DC2ObjectCatalog
class defines the metadata of the catalog and file structure in the self.__init__
, defines how to translate data (self._generate_modifiers
), and how to read (self._generate_datasets
).
While the registry
part is contained in a YAML file that specifies the dataset
https://github.com/LSSTDESC/gcr-catalogs/blob/master/GCRCatalogs/catalog_configs/dc2_object_run2.2i_dr6_v2.yaml
together with a site configuration to define a root directory:
https://github.com/LSSTDESC/gcr-catalogs/blob/master/GCRCatalogs/site_config/site_rootdir.yaml
One challenge for the translation
layer is that in standardizing column names, one often invents a new "standard" the people have to learn. However for Rubin/LSST this is largely taken care of for us because (a) it's not under our control so we should just accept whatever the Project does; (b) stuff is supposed to be standardized in the schema by the Data Products Definition Document: https://ls.st/dpdd.
(a) i. This really is a simplification in our work. Not need to discuss and figure out prefix vs. suffix for "mag" and "flux" decoration. No discussion of units. We (DESC) just have to accept it. ii. Accepting this means our tests should be easily portable[*] to have the Project run. This is the point that @cwwalter and @bechtol have been discussing about make sure that test that we develop are things that the Rubin Project can easily adopt.
(b) i. Now not everything is in the DPDD yet, or there are cases where the exact structure (individual rows or array of floats for multi-band quantities) isn't meant to be specified by the DPDD . I expect some updates to the DPDD to come along with DP0.2. Or at least some de-facto updates to the DPDD in the sense that the DP0.2 table schema will be come the new reference. ii. ii. This aligns us with the Project, who will be just as annoyed as DESC about any inconsistencies or un-documented data structures. So the DPDD will be updated, and supporting technical documentation will exist by Project effort (likely aided by some nudging questions of clarification on our part as we use the data).
is_dpdd
option in dc2_object.py
essentially defines things as almost complete pass through withhttps://github.com/LSSTDESC/gcr-catalogs/blob/master/GCRCatalogs/dc2_object.py#L285
if kwargs.get('is_dpdd'):
self._quantity_modifiers = {col: None for col in self._schema}
bands = [col[0] for col in self._schema if len(col) == 5 and col.startswith('mag_')]
So the first line in the block defines the quantity_modifiers as None. The second line just gets a bit of metadata about what bands are available, because various internal processing might result in not having all of "ugrizy" for every data sets
So what's all the rest of the stuff in dc2_object.py
? It's translation code from before a time when the Project was standardizing the internal products to line up with the DPDD. Since dc2_object.py
was written, the data standardization on the project side has been implemented. For gory details and fun with functors, this is the pipe_tasks
routine that runs this post-processing:
and the translations are are defined in YAML configs:
https://github.com/lsst/pipe_tasks/blob/main/schemas/Object.yaml
(with some survey-specific custom overrides in obs_
packages.)
The YAML configs refer to functors defined in:
https://github.com/lsst/pipe_tasks/blob/main/python/lsst/pipe/tasks/functors.py#L80
So tests written in DESCQA on data sets that have been processed all the way through (including the post-processing) will already be using column names and schema that is the DPDD.
Tests on other datasets, most notably simulations, internal DESC products, and value-added catalogs don't have a standardized schema that Rubin is using. So some clear definition of schema and translation layer will be necessary for such things.
This was one of the motivating development cases for GCRCatalogs: standardize access patterns across different datasets with different column names, schemas, and definitions.
It's also the case that if we write test that compare different datasets (DES, HSC SSP, LSST), we will have the same translation problem. We are LSST DESC, so I suggest that we standardize on LSST DPDD and adapt reading other data to that.
So I maintain that the availability of a translation layer will always be a part of DESC SRV efforts. We also already have an example of a no-op translation layer for Object Table data in the LSST DPDD format.
Is a translation layer required? Or what does "easily portable" mean?
Exec Summary[*]: With some care, the "easily" part should be achievable. In parallel with the efforts to run DESCQA through GCRCatalogs on both current DC2 Run 2.2i data and on imminent DP0.2 data, a separate effort can be made to make Tests easily portable.
[*] Because Comment 8 in a thread is definitely where you should be looking for an Executive Summary.
For data sets that are trivial to read from disk/DB and fit in memory, either of the following approaches to read
the data:
A.
catalog_instance = GCRCatalogs.load_catalog("dc2_object_run2.2i_dr6")
df = catalog_instance.get_quantities(["ra", "dec", "mag_g", "mag_r"])
B.
data_pathname = "/global/cfs/cdirs/lsst/shared/DC2-prod/Run2.2i/dpdd/Run2.2i-dr6-v2/object_table_summary"
df = pd.read_parquet(data_pathname)
allow for the following code to work unmodified.
def my_test(df):
print(df["mag_g"].mean())
plt.hist(df["mag_g"])
plt.scatter(df["ra"], df["dec"])
plt.scatter(df["mag_g"], df["mag"r"])
So the API for accessing the code object df
to make plots and calculations is the same. If we can write our tests such that the expect such an object, then the same test can work with reading data either through A. or B.
But we have data that don't fit in memory, and large computations that we would like to parallelize. So can we build this abstraction successfully while still allowing for acceptable performance?
E.g., if I knew my data didn't fit in memory, I might choose to get an iterator from GCRCatalogs:
df = catalog_instance.get_quantities(["ra", "dec", "mag_g", "mag_r"], return_iterator=True)
my_test(df)
or use Dask Distributed
import dask.dataframe as dd
# [...] Start your Dask cluster.
df = dd.read_parquet(data_path, columns=columns, engine='pyarrow')
my_test(df)
Something like this should work, but we need to to test and see when this works and when it doesn't.
There's also a structural issue when processing out-of-memory datasets. You really want to only iterate through it once, so you want to do all the relevant processing of a given column of data when it's there in memory. E.g., my_test
itself does 3 different not-clearly related things. You probably would write these as different tests. But if that means you read the data 3x, then some level of deviation from pure test separation is likely worth it for performance.
Right now many of the DESCQA tests have some catalog_instance.get_quantities
reading logic within a loop that does a bunch of different things likely in part for this reason. But I also am optimistic that many of tests can be just minimally restructured to accept either GCRCatalog get_quantities dataframish or Pandas/Dask-generated dataframes objects to pass to the key functions.
[*] Because Comment 8 in a thread is definitely where you should be looking for an Executive Summary.
😂
Thanks @wmwv for the detailed analysis. This is very helpful for guiding our choices. I generally agree with your analysis and assessment maybe with one exception. Near the end you said:
But I also am optimistic that many of tests can be just minimally restructured to accept either GCRCatalog get_quantities dataframish or Pandas/Dask-generated dataframes objects to pass to the key functions.
I am not quite sure how we can minimally change the function so that it accepts both a GCRCatalogs iterator and a pandas/dask dataframe. The test would supposedly need very different code to be applied to these two cases. And I think that's the main issue we are facing. Do you have an example of what you have in mind what the test would look like? For example, in your earlier comment, my_test
can somehow take in both a GCRCatalogs iterator and a pandas/dask dataframe. Does the code in my_test
just have a if statement and switch to different ways of process, or your have some other ideas in mind?
Thanks @wmwv. Regarding the translation aspect: I have in the past appreciated the simplification that DPDD imposes on us in case we had to do an alternative to GCRCatalogs for whatever reason. But it is true that if we wanted to include new data sets in SRV efforts, the extra work would be immense, and duplicating or having to fall back to GCRCatalogs again. Also, maybe we have to deal with 2+ DPDD versions, if we want to compare releases.
A summary of viewpoints from @cwwalter:
I have no opinion on your use of GCR for the srv tests, but my request going forward is that you don’t actually build our basic variable definitions etc into the abstraction interface so analysis happening outside of this group has the ability for standard tools to directly work with the underlying files. I put a more concrete idea on one approach in the Slack thread in desc-srv.
If you would like the capability of tests to be swapped in and out and used with those that are also used by the commissioning groups, then relying on DESC specific dependencies may be an issue as all members of commissioning commit to only using project tools for their software. But, as Nacho notes above, these test might be of a different scope, in which case it wouldn’t be an issue at all. On the other hand, if you find that you are interested in some of the tests we develop during commissioning to look at data quality, or (I suppose it is possible) for early srv tests to make it to later commissioning (not positive about the timing given the data preview release schedule) then considering using the same project framework might be worthwhile.
From @joezuntz:
One thing that I've really learned from the TXPipe experience is the importance of being able to process data in chunks (and ideally in parallel) rather than all at once. If you've tried to play with SkySim 5000 data then you'll have seen this - it just isn't feasible to load more than a column or two!
Some options:
While it's initially attractive, and lets you write algorithms in the most numpy-like way possible, for me Dask ultimately requires as much specialized code (and coding effort) as using a translation layer. It also has the added challenge of being much more opaque and difficult to optimize.
So I would advocate for making a decision now to retain the GCR translation layer when building tests, and require them to use that. (As Michael noted in the Slack conversation and above, this is orthogonal to the question of the underlying file format.) There are, though, a few things that can be done to make it easier for newcomers to adapt tests to this form, and write new ones:
In the May 19th SRV telecon, we started a discussion, prompted by @yymao on settling what should be the abstract connection layer between the coded tests and the data. Currently, as the test framework is being developed with DESCQA as a baseline, this is to be done through GCRCatalogs, but other options can be desirable, and this was reflected in the desc-srv Slack channel.
What should be the baseline abstract layer to connect to data for tests being developed in the Test Framework?
(I will add some of the comments below later if nobody does, feel free to add them)