[RFC] Abstraction layer for the tests implemented in the SRV test framework

nsevilla commented 2 years ago

In the May 19th SRV telecon, we started a discussion, prompted by @yymao on settling what should be the abstract connection layer between the coded tests and the data. Currently, as the test framework is being developed with DESCQA as a baseline, this is to be done through GCRCatalogs, but other options can be desirable, and this was reflected in the desc-srv Slack channel.

What should be the baseline abstract layer to connect to data for tests being developed in the Test Framework?

(I will add some of the comments below later if nobody does, feel free to add them)

wmwv commented 2 years ago

I want to start by laying out some context for how I think about the different aspects:

GCRCatalogs has several different aspects:

GCRCatalogs.translate: translate column names, redefine column quantities (i.e., scaling for units), mixing quantities (re-defining ellipticity conventions)
GCRCatalogs.read: reading data files: HDF, Parquet, CSV, SQLite, live DBs
GCRCatalogs.registry: The data registry. Where do files live? What is the connection info for the database.

These are not actually the names of submodules of GCRCatalogs, but it's conceptually what I want to bring out here. translate and read are often found together in a given GCRCatalogs module. E.g., https://github.com/LSSTDESC/gcr-catalogs/blob/master/GCRCatalogs/dc2_object.py#L224 defines the DC2ObjectCatalog class defines the metadata of the catalog and file structure in the self.__init__, defines how to translate data (self._generate_modifiers), and how to read (self._generate_datasets).

While the registry part is contained in a YAML file that specifies the dataset https://github.com/LSSTDESC/gcr-catalogs/blob/master/GCRCatalogs/catalog_configs/dc2_object_run2.2i_dr6_v2.yaml together with a site configuration to define a root directory: https://github.com/LSSTDESC/gcr-catalogs/blob/master/GCRCatalogs/site_config/site_rootdir.yaml

wmwv commented 2 years ago

Section I. Translation

One challenge for the translation layer is that in standardizing column names, one often invents a new "standard" the people have to learn. However for Rubin/LSST this is largely taken care of for us because (a) it's not under our control so we should just accept whatever the Project does; (b) stuff is supposed to be standardized in the schema by the Data Products Definition Document: https://ls.st/dpdd.

(a) i. This really is a simplification in our work. Not need to discuss and figure out prefix vs. suffix for "mag" and "flux" decoration. No discussion of units. We (DESC) just have to accept it. ii. Accepting this means our tests should be easily portable[*] to have the Project run. This is the point that @cwwalter and @bechtol have been discussing about make sure that test that we develop are things that the Rubin Project can easily adopt.

(b) i. Now not everything is in the DPDD yet, or there are cases where the exact structure (individual rows or array of floats for multi-band quantities) isn't meant to be specified by the DPDD . I expect some updates to the DPDD to come along with DP0.2. Or at least some de-facto updates to the DPDD in the sense that the DP0.2 table schema will be come the new reference. ii. ii. This aligns us with the Project, who will be just as annoyed as DESC about any inconsistencies or un-documented data structures. So the DPDD will be updated, and supporting technical documentation will exist by Project effort (likely aided by some nudging questions of clarification on our part as we use the data).

wmwv commented 2 years ago

For Rubin/LSST data products, we should following the naming conventions in the exposed column names of any translation. We should write tests assuming the DPDDish names. The is_dpdd option in dc2_object.py essentially defines things as almost complete pass through with

https://github.com/LSSTDESC/gcr-catalogs/blob/master/GCRCatalogs/dc2_object.py#L285

        if kwargs.get('is_dpdd'):
            self._quantity_modifiers = {col: None for col in self._schema}
            bands = [col[0] for col in self._schema if len(col) == 5 and col.startswith('mag_')]

So the first line in the block defines the quantity_modifiers as None. The second line just gets a bit of metadata about what bands are available, because various internal processing might result in not having all of "ugrizy" for every data sets

wmwv commented 2 years ago

So what's all the rest of the stuff in dc2_object.py? It's translation code from before a time when the Project was standardizing the internal products to line up with the DPDD. Since dc2_object.py was written, the data standardization on the project side has been implemented. For gory details and fun with functors, this is the pipe_tasks routine that runs this post-processing:

https://github.com/lsst/pipe_tasks/blob/78081a9382b1dbd9e56d0b42df177eb38194fb43/python/lsst/pipe/tasks/postprocess.py

and the translations are are defined in YAML configs:

https://github.com/lsst/pipe_tasks/blob/main/schemas/Object.yaml

(with some survey-specific custom overrides in obs_ packages.)

The YAML configs refer to functors defined in:

https://github.com/lsst/pipe_tasks/blob/main/python/lsst/pipe/tasks/functors.py#L80

wmwv commented 2 years ago

So tests written in DESCQA on data sets that have been processed all the way through (including the post-processing) will already be using column names and schema that is the DPDD.

wmwv commented 2 years ago

Tests on other datasets, most notably simulations, internal DESC products, and value-added catalogs don't have a standardized schema that Rubin is using. So some clear definition of schema and translation layer will be necessary for such things.

This was one of the motivating development cases for GCRCatalogs: standardize access patterns across different datasets with different column names, schemas, and definitions.

It's also the case that if we write test that compare different datasets (DES, HSC SSP, LSST), we will have the same translation problem. We are LSST DESC, so I suggest that we standardize on LSST DPDD and adapt reading other data to that.

wmwv commented 2 years ago

So I maintain that the availability of a translation layer will always be a part of DESC SRV efforts. We also already have an example of a no-op translation layer for Object Table data in the LSST DPDD format.

wmwv commented 2 years ago

Section II. Accessing Data (API)

Is a translation layer required? Or what does "easily portable" mean?

Exec Summary[*]: With some care, the "easily" part should be achievable. In parallel with the efforts to run DESCQA through GCRCatalogs on both current DC2 Run 2.2i data and on imminent DP0.2 data, a separate effort can be made to make Tests easily portable.

[*] Because Comment 8 in a thread is definitely where you should be looking for an Executive Summary.

wmwv commented 2 years ago

For data sets that are trivial to read from disk/DB and fit in memory, either of the following approaches to read the data:

A.

catalog_instance = GCRCatalogs.load_catalog("dc2_object_run2.2i_dr6")
df = catalog_instance.get_quantities(["ra", "dec", "mag_g", "mag_r"])

B.

data_pathname = "/global/cfs/cdirs/lsst/shared/DC2-prod/Run2.2i/dpdd/Run2.2i-dr6-v2/object_table_summary"
df = pd.read_parquet(data_pathname)

allow for the following code to work unmodified.

def my_test(df):
    print(df["mag_g"].mean())
    plt.hist(df["mag_g"])
    plt.scatter(df["ra"], df["dec"])
    plt.scatter(df["mag_g"], df["mag"r"])

So the API for accessing the code object df to make plots and calculations is the same. If we can write our tests such that the expect such an object, then the same test can work with reading data either through A. or B.

wmwv commented 2 years ago

Performance

But we have data that don't fit in memory, and large computations that we would like to parallelize. So can we build this abstraction successfully while still allowing for acceptable performance?

E.g., if I knew my data didn't fit in memory, I might choose to get an iterator from GCRCatalogs:

df = catalog_instance.get_quantities(["ra", "dec", "mag_g", "mag_r"], return_iterator=True)
my_test(df)

or use Dask Distributed

import dask.dataframe as dd
# [...] Start your Dask cluster.

df = dd.read_parquet(data_path, columns=columns, engine='pyarrow')
my_test(df)

Something like this should work, but we need to to test and see when this works and when it doesn't.

wmwv commented 2 years ago

There's also a structural issue when processing out-of-memory datasets. You really want to only iterate through it once, so you want to do all the relevant processing of a given column of data when it's there in memory. E.g., my_test itself does 3 different not-clearly related things. You probably would write these as different tests. But if that means you read the data 3x, then some level of deviation from pure test separation is likely worth it for performance.

wmwv commented 2 years ago

Right now many of the DESCQA tests have some catalog_instance.get_quantities reading logic within a loop that does a bunch of different things likely in part for this reason. But I also am optimistic that many of tests can be just minimally restructured to accept either GCRCatalog get_quantities dataframish or Pandas/Dask-generated dataframes objects to pass to the key functions.

cwwalter commented 2 years ago

[*] Because Comment 8 in a thread is definitely where you should be looking for an Executive Summary.

😂

yymao commented 2 years ago

Thanks @wmwv for the detailed analysis. This is very helpful for guiding our choices. I generally agree with your analysis and assessment maybe with one exception. Near the end you said:

But I also am optimistic that many of tests can be just minimally restructured to accept either GCRCatalog get_quantities dataframish or Pandas/Dask-generated dataframes objects to pass to the key functions.

I am not quite sure how we can minimally change the function so that it accepts both a GCRCatalogs iterator and a pandas/dask dataframe. The test would supposedly need very different code to be applied to these two cases. And I think that's the main issue we are facing. Do you have an example of what you have in mind what the test would look like? For example, in your earlier comment, my_test can somehow take in both a GCRCatalogs iterator and a pandas/dask dataframe. Does the code in my_test just have a if statement and switch to different ways of process, or your have some other ideas in mind?

nsevilla commented 2 years ago

Thanks @wmwv. Regarding the translation aspect: I have in the past appreciated the simplification that DPDD imposes on us in case we had to do an alternative to GCRCatalogs for whatever reason. But it is true that if we wanted to include new data sets in SRV efforts, the extra work would be immense, and duplicating or having to fall back to GCRCatalogs again. Also, maybe we have to deal with 2+ DPDD versions, if we want to compare releases.

nsevilla commented 2 years ago

A summary of viewpoints from @cwwalter:

I have no opinion on your use of GCR for the srv tests, but my request going forward is that you don’t actually build our basic variable definitions etc into the abstraction interface so analysis happening outside of this group has the ability for standard tools to directly work with the underlying files. I put a more concrete idea on one approach in the Slack thread in desc-srv.
If you would like the capability of tests to be swapped in and out and used with those that are also used by the commissioning groups, then relying on DESC specific dependencies may be an issue as all members of commissioning commit to only using project tools for their software. But, as Nacho notes above, these test might be of a different scope, in which case it wouldn’t be an issue at all. On the other hand, if you find that you are interested in some of the tests we develop during commissioning to look at data quality, or (I suppose it is possible) for early srv tests to make it to later commissioning (not positive about the timing given the data preview release schedule) then considering using the same project framework might be worthwhile.

nsevilla commented 2 years ago

From @joezuntz:

One thing that I've really learned from the TXPipe experience is the importance of being able to process data in chunks (and ideally in parallel) rather than all at once. If you've tried to play with SkySim 5000 data then you'll have seen this - it just isn't feasible to load more than a column or two!

Some options:

Go "all in" on Dask and write everything in it, so that the parallelization and streaming is abstracted away and dealt with in dask, and then simply use flat files accessed through dask.
Use a translation layer, and GCRCatalogs seems to make the most sense, and write functions to deal with the catalog objects it provides.

While it's initially attractive, and lets you write algorithms in the most numpy-like way possible, for me Dask ultimately requires as much specialized code (and coding effort) as using a translation layer. It also has the added challenge of being much more opaque and difficult to optimize.

So I would advocate for making a decision now to retain the GCR translation layer when building tests, and require them to use that. (As Michael noted in the Slack conversation and above, this is orthogonal to the question of the underlying file format.) There are, though, a few things that can be done to make it easier for newcomers to adapt tests to this form, and write new ones:

make it really simple to turn a set of flat arrays into a GCR Catalog on the fly. So if I have already loaded (e.g. from a precursor data set) or simulated my ra, dec, and fluxes, then it should be easy to collect them into an in-memory catalog that I can pass to my test to make sure it works.
write clear documentation specifically tailored to writing tests like these, explaining why we need GCR and how to use it in testing.
possibly, provide common (parallel?) functionality for frequent patterns like "taking the mean of a catalog column", "making a histogram or one quantity" or "making a scatter plot of two quantities".

LSSTDESC / SRV-planning