catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

Create simple test case for failure to load PUDL into Postgresql #430

Closed zaneselvans closed 4 years ago

zaneselvans commented 4 years ago

As @briannacote noted in #413, there's some issue which prevents the loading of even an already flattened / simple PUDL data package into a postgres database. @roll thinks this is something which should actually work, but needs a test case to debug it and fix it in the data package libraries.

We need to come up with a simple test case that demonstrates the "non-unique foreign key" issue.

The error which gets thrown:

psycopg2.errors.InvalidForeignKey: there is no unique constraint matching given keys for referenced table "generators_entity_eia"

Resulting from this SQL:

CREATE TABLE ownership_eia860 (
    id SERIAL NOT NULL, 
    report_date DATE, 
    utility_id_eia INTEGER, 
    plant_id_eia INTEGER, 
    generator_id TEXT, 
    operational_status_code TEXT, 
    owner_utility_id_eia INTEGER, 
    owner_name TEXT, 
    owner_state TEXT, 
    owner_city TEXT, 
    owner_street_address TEXT, 
    owner_zip_code TEXT, 
    fraction_owned NUMERIC, 
    PRIMARY KEY (id), 
    FOREIGN KEY(utility_id_eia) REFERENCES utilities_entity_eia (utility_id_eia), 
    FOREIGN KEY(plant_id_eia) REFERENCES generators_entity_eia (plant_id_eia), 
    FOREIGN KEY(generator_id) REFERENCES generators_entity_eia (generator_id)
)

]
(Background on this error at: http://sqlalche.me/e/f405)
briannacote commented 4 years ago

Hey there,

I ran into this with a few tables with the Foreign Keys. I can provide further details if needed. Anything you need from my attempts at this just let me know.

I have also starting building code around the SQLite database as well. I've not finished but made some headway that way too.

Best, Bri

roll commented 4 years ago

@briannacote Hi, yes, it would be really great if you provide some details

briannacote commented 4 years ago

Cool. I had to update the datapackage.json file and remove the following foreign keys for the EIA data I was wanting to use (860 and 923). For give the basic-ness of this diff. The left hand is the original datapackage.json file created from the pudl process. The right hand side I had to remove certain foreign keys. Not all of them. I had to do a handful for the ferc .json file as well. I can send the actually files if you like?

(base) admins-MacBook-Pro:eia-example briannacote$ diff datapackage.json datapackage_noForeignKeys.json 320,342d319 < "foreignKeys": [ < { < "fields": "utility_id_eia", < "reference": { < "resource": "utilities_entity_eia", < "fields": "utility_id_eia" < } < }, < { < "fields": "plant_id_eia", < "reference": { < "resource": "generators_entity_eia", < "fields": "plant_id_eia" < } < }, < { < "fields": "generator_id", < "reference": { < "resource": "generators_entity_eia", < "fields": "generator_id" < } < } < ], 702,717d678 < "foreignKeys": [ < { < "fields": "plant_id_eia", < "reference": { < "resource": "generators_entity_eia", < "fields": "plant_id_eia" < } < }, < { < "fields": "generator_id", < "reference": { < "resource": "generators_entity_eia", < "fields": "generator_id" < } < } < ], 989,1004d949 < "foreignKeys": [ < { < "fields": "plant_id_eia", < "reference": { < "resource": "generators_entity_eia", < "fields": "plant_id_eia" < } < }, < { < "fields": "generator_id", < "reference": { < "resource": "generators_entity_eia", < "fields": "generator_id" < } < } < ], 1433,1448d1377 < "foreignKeys": [ < { < "fields": "plant_id_eia", < "reference": { < "resource": "generators_entity_eia", < "fields": "plant_id_eia" < } < }, < { < "fields": "generator_id", < "reference": { < "resource": "generators_entity_eia", < "fields": "generator_id" < } < } < ], 2938c2867 < } \ No newline at end of file

}

briannacote commented 4 years ago

Here's the files. I just realized these were made with my setup in particular. I hope this helps show what relationships were issues.

This is just from the EIA side (923 and 860)

https://drive.google.com/drive/folders/1DSZ8WWfQXFlXwyB9UIUORwZ1jNXVQmoZ?usp=sharing

I can send the ferc difference I saw as well. I didn't look at the CEMS data or the EPA IPM data.

Let me know if there's other items I can provide.

roll commented 4 years ago

Thanks @briannacote I'll use it

roll commented 4 years ago

@cmgosnell @zaneselvans The problems that these two fields from the generators_entity_eia resource miss unique constraint. It has a composite PK but it's not enough for PostgreSQL. Added this constraint and it works for me now with Postgres:

                    {
                        "name": "plant_id_eia",
                        "type": "integer",
                        "description": "EIA Plant Identification number. One to five digit numeric.",
                        "format": "default",
                        "constraints": {
                            "unique": true
                        }
                    },
                    {
                        "name": "generator_id",
                        "type": "string",
                        "description": "Generator identification number",
                        "format": "default",
                        "constraints": {
                            "unique": true
                        }
                    },

I have added a better error message to our code but I think we shouldn't alter incoming data packages adding additional unique/primary keys because it's tricky not to mess with initial metadata.

Should be an easy fix on the PUDL level.

roll commented 4 years ago

Update

                "foreignKeys": [
                    {
                        "fields": "plant_id_eia",
                        "reference": {
                            "resource": "generators_entity_eia",
                            "fields": "plant_id_eia"
                        }
                    },
                    {
                        "fields": "generator_id",
                        "reference": {
                            "resource": "generators_entity_eia",
                            "fields": "generator_id"
                        }
                    }
                ],

probably should be a composite FK

cmgosnell commented 4 years ago

Thanks for catching this @roll! These should have definitely been composite FKs - and so should have several of our other FKs so I went through and cleaned them up. I've just checked in changes to the stored resource metadata that should fix this problem for you @briannacote. It worked fine for me, but please let me know if you run into problems. Sorry it took so long for us to get to the bottom of this.

zaneselvans commented 4 years ago

It seems to work for me! I generated a data package with 2015-2017 data for EIA 860/923 + FERC Form 1, and was able to load it into both SQLite and PostgreSQL using datapackage.save() without any mishaps (once I remembered to drop my old PUDL database...)

I think I'll go ahead and close this issue, but @briannacote please do let us know if something still isn't working on this...

briannacote commented 4 years ago

This is awesome! I'll give it a go soon. I've also connected to the SQLite database as well. So we have options now over here. I'll let you know if I run into any issues again trying it with Postgres.

Also, forgive my delay in getting back. I was out of the office the last two weeks.