design revamp of harvesting process

cmgosnell commented 3 years ago

[x] baby test examples (cg)
[x] check in w/ jan about how this would effect parallelization
[x] draft plan (ew)

ezwelty commented 3 years ago

@cmgosnell I leave tomorrow morning through September 20, so I guess I'll draft out a plan once I'm back. In addition to illustrative example(s) of desired input and output tables, here are questions that came to mind during my early pondering:

What level of error reporting is desired when a field constraint fails? Is it sufficient to report the field name and constraint that failed, or is it also desirable to report the unique field values that failed (when applicable)?
Do multi-column constraints need to be supported? Can we recast two-column locations (e.g. lat, lon) to single geometry columns?
Does an entity identifier need to be created, or is an existing field always used as such? In other words, is it
- [ a | b ] + [ a | c ] = [ a ] entities + [ a | b | c ] data, or
- [ a | b ] + [ a | c ] = [ id | a ] entities + [ id | b | c ] data ? where * denotes primary key columns in the output tables.
Can the normalization groupings be determined from the output tables and their primary keys? For example:
- [ a | b | c ] -> [ a | b ] entities + [ a | c ] data c must be normalized by a*
- [ a | b | c ] -> [ a | b ] entities + [ a | b | c ] data c must be normalized by a, b
Are some primary keys spread across multiple input tables? For example, can there be: [ (a1, b1) ] + [ (a2, c1) ] = [ (a1, ), (a2, ) ] entities + [ (a1, b1, null), (a2, null, c1) ] data where '(...)' denotes a table row.

cmgosnell commented 3 years ago

hi @ezwelty!! sorry I didn't get this to you before you skedaddled but I finally generated examples. I tried to give a few concrete, standard examples, but I could certainly add more if that would be helpful.

I'll respond to these questions shortly. Have fun out there wherever you are going!

cmgosnell commented 3 years ago

great questions!

What level of error reporting is desired when a field constraint fails? Is it sufficient to report the field name and constraint that failed, or is it also desirable to report the unique field values that failed (when applicable)?

Do you mean if a specific field/values (if the plant_name for plant_id_eia == 3 isn't consistent enough) or if a whole column has too many non-harvestable values (if the plant_name column has more than 10% nulls after the harvesting process)? I am going to assume it is the latter because I don't think we want/need to report the individual values when they fail.

On a normal ETL run, it should be sufficient to report that a column has failed its constraints. Nonetheless, it is important for debugging to have a way to jump into the fray to determine how things failed.

Do multi-column constraints need to be supported? Can we recast two-column locations (e.g. lat, lon) to single geometry columns?

We don't currently have multi-column constraints. Nonetheless, if we can do it without too many back-flips it would be nice to be able to do this for lat/long. I don't think there are any other columns which would fall into this bucket. So squishing lat/long together as a geometry and checking for spatial closeness instead of consistency seems like the right way to go.

Does an entity identifier need to be created, or is an existing field always used as such? In other words, is it [ a | b ] + [ a | c ] = [ a ] entities + [ a | b | c ] data, or [ a | b ] + [ a | c ] = [ id | a ] entities + [ id | b | c ] data ? where * denotes primary key columns in the output tables.

All entity identifiers are grabbed from existing fields. Sometimes those identifiers are composite primary key columns

Can the normalization groupings be determined from the output tables and their primary keys? For example: [ a | b | c ] -> [ a | b ] entities + [ a | c ] data c must be normalized by a [ a | b | c ] -> [ a | b ] entities + [ a | *b | c ] data c must be normalized by a, b

The normalization grouping should be able to be determined based on the PKs of the final tables. Both the primary keys (to find the entity identifier columns) and which data columns need to be harvested should be able to be determined based on the metadata for the output tables. If these two examples are an 1 vs 2, it looks like the second example is the way.

Are some primary keys spread across multiple input tables? For example, can there be: [ (a1, b1) ] + [ (a2, c1) ] = [ (a1, ), (a2, ) ] entities + [ (a1, b1, null), (a2, null, c1) ] data where '(...)' denotes a table row.

No. A composite primary key should always come from an input table in which all of its pk columns exist. There are tables which have plant_id_eia and others which have both plant_id_eia and generator_id. If a plant id doesn't show up associated with a generator, than it shouldn't end up in the generator entity table.

ezwelty commented 3 years ago

@cmgosnell I'm back!

Do multi-column constraints need to be supported? Can we recast two-column locations (e.g. lat, lon) to single geometry columns?

We don't currently have multi-column constraints. Nonetheless, if we can do it without too many back-flips it would be nice to be able to do this for lat/long. I don't think there are any other columns which would fall into this bucket. So squishing lat/long together as a geometry and checking for spatial closeness instead of consistency seems like the right way to go.

In that case, I would recommend converting two-column geometries to single-column geometries (e.g. geopandas.GeoSeries to support all types of geometries, or a simple lon, lat tuple if all we want to worry about for now are points) in an initial step. That way we avoid developing a syntax for multi-column constraints just for coordinate pairs.

Are some primary keys spread across multiple input tables? For example, can there be: [ (a1, b1) ] + [ (a2, c1) ] = [ (a1, ), (a2, ) ] entities + [ (a1, b1, null), (a2, null, c1) ] data where '(...)' denotes a table row.

No. A composite primary key should always come from an input table in which all of its pk columns exist.

So more generically, I think you are saying that output table primary keys (whether one or multiple fields) cannot contain null values?

In my example, I was trying to get at whether an entity can be spread across multiple tables. For example, would the input tables:

a	b
a1	b1

and

a	c
a2	c1

where a is the primary key of both output tables, result in the output tables:

entities

a
a1
a2

and

data

a	b	c
a1	b1	`null`
a2	`null`	c1

cmgosnell commented 3 years ago

@ezwelty welcome back! I hope away time way nice.

So more generically, I think you are saying that output table primary keys (whether one or multiple fields) cannot contain null values?

I believe that is the case, although I'm a little fuzzy on how this would work for the association tables. I think in those instances it might be good to enable nulling of the PKs because I think in those cases the PKs are all of the columns. If we are collecting info from tables about which utilities operate in which states and counties.... it would probably be good to enable collecting state-only info for utilities.

In my example, I was trying to get at whether an entity can be spread across multiple tables.

Ah, now I see what you are asking. Yes. Definitely. The id's for a single entity almost always come from multiple tables. And because the reporting isn't airtight (and general mismatch of reporting requirements) we'll end up with missing data for some entities because they didn't show up in all of the input tables.

ezwelty commented 3 years ago

@cmgosnell I've started drafting a proposal and fleshing out your example, but it seems to be incomplete? For example, the resulting 3 output tables discard the linkage between plant_id_eia & generator_id and utility_id_eia & utility_name_eia. There are also fields that don't get carried over, like sales, net_generation_mwh, and capacity_mw. This was likely on purpose to make the example more lightweight, but it will probably serve us better to make it a complete example relative to the input tables. Could you fill out the missing columns?

cmgosnell commented 3 years ago

that is fair! I was just focused on the entity-type output tables, not the "data" output tables. I'll add those tables and columns.

cmgosnell commented 3 years ago

okay @ezwelty I believe I've fleshed out the output tables in the notebook. I checked it into the sprint24 branch! lmk if that works for ya or if you have more questions.

zaneselvans commented 3 years ago

Hey @ezwelty, @cmgosnell and I chatted about the time columns and different frequencies, and also discovered the very helpful df.merge_asof() method, and searched for all of the inter-frequency merges that are happening in the codebase now, which is not that many (and they're in the output objects / analysis layer). So we're thinking of re-naming the time columns to reflect their frequency like report_year and report_month and report_quarter etc. but keeping them all Datetime types.

Given that we do want to combine attribute values from tables that are reported with different frequencies, but which pertain to entities (utilities, plants, etc) that are consistent on some longer timescale, I'm not sure I see a way to avoid having some infrastructure that's specific to the time columns.

ezwelty commented 3 years ago

@zaneselvans Thanks for the update.

... we do want to combine attribute values from tables that are reported with different frequencies, but which pertain to entities (utilities, plants, etc) that are consistent on some longer timescale...

Sure. Could you provide a minimal input-output example here to illustrate what you have in mind? I need to understand the operation in terms of rows, column names, primary keys ... to be able to implement it.

ezwelty commented 3 years ago

@zaneselvans @cmgosnell How would you like me to share work (documentation, code, tests) and gather feedback on it? Temporary standalone catalyst repo, pudl branch, pudl pull request?

A couple questions about the example:

With the harvest logic we hashed out on the call, I can reproduce the example, except for generators_eia860. Expected:

   plant_id_eia generator_id report_date  capacity_mw
0             3            1  2018-01-01        153.1
1             3            1  2017-01-01        153.1
2             3            2  2018-01-01         50.0
3             3            2  2017-01-01         50.0
4             4            a  2017-01-01          NaN
5             4            b  2017-01-01          NaN

Q: Should the last two rows be there? If not, then it is settled. If yes, this would require knowing to harvest only from inputs generators_eia860 and boiler_generator_assn_eia860, which is neither my understanding of data table behavior nor entity/association table behavior. This was previously taken care of by a distinction between report_date and report_year, avoiding pulling in keys from the monthly generation_eia923. It could serve as an example as we further discuss the question of datetime field handling.

With harvest: False (data table), it extracts all fields from the input table with the same name:

   plant_id_eia generator_id report_date  capacity_mw
0             3            1  2018-01-01        153.1
1             3            1  2017-01-01        153.1
2             3            2  2017-01-01         50.0
3             3            2  2018-01-01         50.0

i.e. the 4 rows of input generators_eia860

With harvest: True, it harvests all the instances of the primary key fields (plant_id_eia, generator_id, report_date) from the other input tables. If rows where all data fields are null are removed, the result is equivalent to the above, since only input generators_eia860 has both the primary key fields and capacity_mw.

    plant_id_eia generator_id report_date  capacity_mw
0              3            1  2017-01-01        153.1
1              3            1  2018-01-01        153.1
2              3            1  2018-02-01          NaN
3              3            1  2018-03-01          NaN
4              3            1  2018-04-01          NaN
5              3            1  2018-05-01          NaN
6              3            1  2018-06-01          NaN
7              3            1  2018-07-01          NaN
8              3            1  2018-08-01          NaN
9              3            1  2018-09-01          NaN
10             3            1  2018-10-01          NaN
11             3            1  2018-11-01          NaN
12             3            1  2018-12-01          NaN
13             3            2  2017-01-01         50.0
14             3            2  2018-01-01         50.0
15             4            a  2017-01-01          NaN
16             4            b  2017-01-01          NaN

I am curious why input table generators_eia860 links plant_id_eia and generator_id with utility_id_eia, but this relation is not retained anywhere in the outputs. If an unintentional omission, I presume a simple association table would be used to preserve that relationship?

p.s. Not sure if this is helpful, but in regards to time, here is how I would imagine organizing the tables. The example assumes 1-many utilities-plants, and that attributes are reported either annually, quarterly, or monthly.

utility: [utility_id, ... ] All "static" attributes harvested across all tables and all time scales reported in association with a utility_id.
utility_annual: [utility_id, report_year, ... ] All attributes harvested across all tables reported annually in association with a utility_id and report_year.
utility_quarterly: [utility_id, report_quarter, ... ] All quarterly attributes ...
utility_monthly: [utility_id, report_month, ... ] All monthly attributes ...
plant: [plant_id, utility_id, ... ]
plant_annual: [plant_id, report_year, ... ]
...

zaneselvans commented 3 years ago

We've been organizing feedback via branch-and-PR, which each of us making a branch off of the current sprint branch (sprint25 right now, sprint26 starting next week, but it doesn't exist yet), and then doing a PR against whatever the current sprint branch is t the time of the PR's creation. After each sprint we merge anything into the sprint branch back into dev and create a new sprint branch.

Let me think some more about the other stuff in here.

catalyst-cooperative / pudl

design revamp of harvesting process #746