catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 105 forks source link

Nightly build failure 2024-01-24 #3284

Closed zaneselvans closed 5 months ago

zaneselvans commented 5 months ago

Data validation failure

The utility and balancing authority service territories derived from EIA-861 have significantly more records than expected.

test/validate/service_territory_test.py::test_minmax_rows[compiled_geometry_balancing_authority_eia861-108436] 
-------------------------------- live log call ---------------------------------
2023-08-15 14:53:49 [    INFO] catalystcoop.pudl.validate:194 compiled_geometry_balancing_authority_eia861: found 112507 rows, expected 108436. Off by 3.754%, allowed margin of 0.000%
FAILED                                                                   [ 99%]
test/validate/service_territory_test.py::test_minmax_rows[compiled_geometry_utility_eia861-237872] 
-------------------------------- live log call ---------------------------------
2023-08-15 14:53:52 [    INFO] catalystcoop.pudl.validate:194 compiled_geometry_utility_eia861: found 247705 rows, expected 237872. Off by 4.134%, allowed margin of 0.000%

Build script success criteria:

Datasette doesn't match?

Curious whether the datasette deployment had taken place given the above combination of failure and perceived success, I counted the rows in the two affected tables, and to my surprise, it matched neither the expected or observed row counts above:

SELECT COUNT(*) FROM out_eia861__compiled_geometry_utilities;
-- 248987
SELECT COUNT(*) FROM out_eia861__compiled_geometry_balancing_authorities;
-- 112853

ETL Logs:

pudl-etl.log

jdangerx commented 5 months ago

I'll note here that the pudl_codes_datasources table does include a pudl_version column which appears to point at the right git hash:

image
jdangerx commented 5 months ago

I'm surprised that the pytest ... && touch $PUDL_OUTPUT/success led to touching if pytest failed... going to see if the successful build from 2024-01-23 also had this behavior.

jdangerx commented 5 months ago

I think the logs you're looking at are for the wrong day, actually! 😌 See the timestamp:

2023-08-15 14:53:52 [    INFO] catalystcoop.pudl.validate:194 compiled_geometry_utility_eia861: found 247705 rows, expected 237872. Off by 4.134%, allowed margin of 0.000%

I can't find any test failures in the 2024-01-24 logs - instead I see test passes (albeit weirdly intertwined with some other tests bc of our multi-worker situation):

test/validate/service_territory_test.py::test_minmax_rows[compiled_geometry_balancing_authority_eia861-112853] 
[gw3] [ 93%] PASSED test/validate/service_territory_test.py::test_minmax_rows[compiled_geometry_balancing_authority_eia861-112853] 
test/validate/service_territory_test.py::test_minmax_rows[compiled_geometry_utility_eia861-248987] 
[gw2] [ 93%] PASSED test/validate/plant_parts_eia_test.py::test_run_aggregations[eia_annual] 
test/validate/bf_eia923_test.py::test_vs_bounds[eia_monthly-coal_heat_content] 
[gw2] [ 93%] SKIPPED test/validate/bf_eia923_test.py::test_vs_bounds[eia_monthly-coal_heat_content] 
test/validate/eia_test.py::test_unique_rows_eia[eia_monthly-bga_eia860-unique_subset1] 
[gw3] [ 93%] PASSED test/validate/service_territory_test.py::test_minmax_rows[compiled_geometry_utility_eia861-248987] 

In addition, it seems like 112853 is the new expected row count, which matches up with Datasette. So that's good too!

zaneselvans commented 5 months ago

🤦🏼

zaneselvans commented 5 months ago

Okay, so this was just a network hiccup in attempting to publish the data release to the Zenodo Sandbox, and actually everything is materially fine.