catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
471 stars 108 forks source link

Need metadata documentation on entity_types (specifically type "N" and "blank") #1841

Closed jacobalder closed 2 years ago

jacobalder commented 2 years ago

Entity type Documentation Error in utilities_eia860

Hi! Thanks for PUDL, what an awesome resource.

Here's the issue:

The utilities_eia860 table contains the following entity types (with their count in the table to the right)

entity_type count
[blank] 0
N 15812
C 4514
COM 3531
F 497
I 3575
IND 3978
M 13730
O 68
P 631
Q 19446
S 976

Each of which is documented in the extensive PUDL documentation ... EXCEPT FOR entity types "N" and "[blank]".

What are these types? I have also not found the definition for "N" or "blank" on the EIA 860 form documentation. Where do these come from? Also, how can "blank" be an entity type??

Bug Severity

How badly is this bug affecting you?

To Reproduce

In the PUDL SQLite you input the query: SELECT DISTINCT entity_type FROM utilities_eia860 ORDER BY entity_type

OR click on this link to PUDL

OR click this link to PUDL

Expected behavior

What is entity_type = "N"? What is entity_type = "[blank]"? I need a definition for both of these in order to correctly group entities for my project

Software Environment?

Suggestion

I would suggest adding an entity_types_eia table to the PUDL database and adding documentation for the entity_type "N". Even though the entity type "[blank]" appears to have zero observations in the table, it's still there and should be eliminated, defined, or subsumed into "O" for "Other".

[^1]: note: if there is a primary key, please point me towards it!! I have not found it in the documentation and was not necessarily expecting it because it does not appear to exist in the EIA 860 source tables)

zaneselvans commented 2 years ago

Hi there, thanks for using PUDL!

There is a primary key in the utilities_eia860 table, which you can see if you look at the SQL at the bottom of the main page for that table, but maybe that's not the table you're thinking of? This table creation statement should exist and explicitly name the PK for every table though.

CREATE TABLE utilities_eia860 (
    utility_id_eia INTEGER NOT NULL CHECK (TYPEOF("utility_id_eia") = 'integer'), 
    report_date DATE NOT NULL CHECK ("report_date" IS DATE("report_date")), 
    street_address TEXT CHECK ("street_address" IS NULL OR TYPEOF("street_address") = 'text'), 
    city TEXT CHECK ("city" IS NULL OR TYPEOF("city") = 'text'), 
    state TEXT CHECK ("state" IS NULL OR TYPEOF("state") = 'text'), 
    zip_code TEXT CHECK ("zip_code" REGEXP '^\d{5}$') CHECK ("zip_code" IS NULL OR TYPEOF("zip_code") = 'text'), 
    plants_reported_owner BOOLEAN CHECK ("plants_reported_owner" IS NULL OR (TYPEOF("plants_reported_owner") = 'integer' AND "plants_reported_owner" IN (0, 1))), 
    plants_reported_operator BOOLEAN CHECK ("plants_reported_operator" IS NULL OR (TYPEOF("plants_reported_operator") = 'integer' AND "plants_reported_operator" IN (0, 1))), 
    plants_reported_asset_manager BOOLEAN CHECK ("plants_reported_asset_manager" IS NULL OR (TYPEOF("plants_reported_asset_manager") = 'integer' AND "plants_reported_asset_manager" IN (0, 1))), 
    plants_reported_other_relationship BOOLEAN CHECK ("plants_reported_other_relationship" IS NULL OR (TYPEOF("plants_reported_other_relationship") = 'integer' AND "plants_reported_other_relationship" IN (0, 1))), 
    entity_type TEXT CHECK ("entity_type" IS NULL OR TYPEOF("entity_type") = 'text'), 
    attention_line TEXT CHECK ("attention_line" IS NULL OR TYPEOF("attention_line") = 'text'), 
    address_2 TEXT CHECK ("address_2" IS NULL OR TYPEOF("address_2") = 'text'), 
    zip_code_4 TEXT CHECK ("zip_code_4" REGEXP '^\d{4}$') CHECK ("zip_code_4" IS NULL OR TYPEOF("zip_code_4") = 'text'), 
    contact_firstname TEXT CHECK ("contact_firstname" IS NULL OR TYPEOF("contact_firstname") = 'text'), 
    contact_lastname TEXT CHECK ("contact_lastname" IS NULL OR TYPEOF("contact_lastname") = 'text'), 
    contact_title TEXT CHECK ("contact_title" IS NULL OR TYPEOF("contact_title") = 'text'), 
    phone_number TEXT CHECK ("phone_number" IS NULL OR TYPEOF("phone_number") = 'text'), 
    phone_extension TEXT CHECK ("phone_extension" IS NULL OR TYPEOF("phone_extension") = 'text'), 
    contact_firstname_2 TEXT CHECK ("contact_firstname_2" IS NULL OR TYPEOF("contact_firstname_2") = 'text'), 
    contact_lastname_2 TEXT CHECK ("contact_lastname_2" IS NULL OR TYPEOF("contact_lastname_2") = 'text'), 
    contact_title_2 TEXT CHECK ("contact_title_2" IS NULL OR TYPEOF("contact_title_2") = 'text'), 
    phone_number_2 TEXT CHECK ("phone_number_2" IS NULL OR TYPEOF("phone_number_2") = 'text'), 
    phone_extension_2 TEXT CHECK ("phone_extension_2" IS NULL OR TYPEOF("phone_extension_2") = 'text'), 
    PRIMARY KEY (utility_id_eia, report_date), 
    FOREIGN KEY(utility_id_eia) REFERENCES utilities_entity_eia (utility_id_eia)
);

Looking at the entity_type values by year, I see that all of the N values are from 2001-2006, and the data quality control definitely gets worse the further back you go. In the key to the codes for the Utility table in 2006, it doesn't define the meaning for entity_type=N, so we don't know what it means... If I had to make a wild ass guess, I would guess that it means non-regulated utility? Since in those years they seem to only be categorizing the ownership of regulated utilities.

I'll also note that the coding table you pointed to in the documentation is currently based on the entity_type values reported in the EIA-861 data, and we haven't actually reconciled the codes from EIA-860 with EIA-861 yet (as EIA-861 isn't yet pulled into the core database) so there might be discrepancies between those two sets of codes (though they do seem to refer to the same categorization of ownership entities)

entity_type = blank will probably end up being an NA values -- they just didn't fill in that field. Or we may be able to backfill with the entity_type reported in later years if it's stable over time. Not sure though.

I think the main confusion here is that this coding table is in our documentation but not in the database, and that's intentional, since it's currently referring to data that's also not in the database, but it does have some overlap with the EIA-860 data, and will eventually be reconciled with it, but at the moment they're still distinct and messy categorizations / works-in-progress.

jacobalder commented 2 years ago

Thank you for your detailed comment--including your response to my footnote. Indeed, this is a helpful verification.

I'm a bit perplexed about the differences in entity_type and am going to use the assumption that the 861 documentation and 860 documentation refer to the same common dictionary. If they don't, it's going to be a much harder lift to reconcile.

zaneselvans commented 2 years ago

I think they're probably supposed to refer to the same set of codes, but sometimes they don't coordinate well between the different forms, and sometimes the codes change from year to year. I imagine (🤞🏼) there's a way to reconcile them but it'll take some digging.