catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
471 stars 108 forks source link

Stress test the new MCOE outputs #1107

Closed zaneselvans closed 3 years ago

zaneselvans commented 3 years ago

Fixing the merge_on_date_year issues has the potential to affect many of the MCOE and EIA 923 outputs. Do sanity checks and visualizations to make sure nothing is amiss, and if possible integrate them into the data validation or integration tests.

Weirdness to look into

zaneselvans commented 3 years ago

Some Visualizations

pudl_out_monthly = pudl.output.pudltabl.PudlTabl(pudl_engine=pudl_engine, freq="MS")
mcoe = pudl_out_monthly.mcoe()[
    [
        "report_date",
        "plant_id_eia",
        "unit_id_pudl",
        "generator_id",
        "capacity_mw",
        "capacity_factor",
        "fuel_cost_per_mmbtu",
        "fuel_cost_per_mwh",
        "fuel_type_code_pudl",
        "heat_rate_mmbtu_mwh",
        "net_generation_mwh",
        "prime_mover_code",
    ]
]

mcoe_cc = mcoe[
    (mcoe.net_generation_mwh.notna())
    & (mcoe.prime_mover_code.isin(["CC", "CT", "CA", "CS"]))
]

mcoe_st = mcoe[
    (mcoe.net_generation_mwh.notna())
    & (mcoe.prime_mover_code == "ST")
]

mcoe_gt = mcoe[
    (mcoe.net_generation_mwh.notna())
    & (mcoe.prime_mover_code == "GT")
]

kwargs = {
    "bins": 100,
    "range": (6, 16),
    "lw": 0,
    "alpha": 0.5,
}
plt.hist(
    mcoe_cc.heat_rate_mmbtu_mwh,
    weights=mcoe_cc.net_generation_mwh,
    label="Combined Cycle",
    **kwargs,
)
plt.hist(
    mcoe_st.heat_rate_mmbtu_mwh,
    weights=mcoe_st.net_generation_mwh,
    label="Steam Turbine",
    **kwargs,
)
plt.hist(
    mcoe_gt.heat_rate_mmbtu_mwh,
    weights=mcoe_gt.net_generation_mwh,
    label="Combustion Turbine",
    **kwargs,
)

plt.title("Monthly Average Heat Rate by Prime Mover (Weighted by MWh)")
plt.xlabel("Heat Rate (MMBTU/MWh)")
plt.ylabel("MWh")
plt.legend()
plt.show();

mcoe_gas = mcoe[
    (mcoe.net_generation_mwh.notna())
    & (mcoe.fuel_type_code_pudl == "gas")
]

mcoe_coal = mcoe[
    (mcoe.net_generation_mwh.notna())
    & (mcoe.fuel_type_code_pudl == "coal")
]

mcoe_oil = mcoe[
    (mcoe.net_generation_mwh.notna())
    & (mcoe.fuel_type_code_pudl == "oil")
]

plt.hist(
    mcoe_gas.heat_rate_mmbtu_mwh,
    weights=mcoe_gas.net_generation_mwh,
    label="Gas",
    **kwargs,
)
plt.hist(
    mcoe_coal.heat_rate_mmbtu_mwh,
    weights=mcoe_coal.net_generation_mwh,
    label="Coal",
    **kwargs,
)
plt.hist(
    mcoe_oil.heat_rate_mmbtu_mwh,
    weights=mcoe_oil.net_generation_mwh,
    label="Oil",
    **kwargs,
)
plt.title("Monthly Average Heat Rate by Primary Fuel (Weighted by MWh)")
plt.xlabel("Heat Rate (MMBTU/MWh)")
plt.ylabel("MWh")
plt.legend()
plt.show();

image image

kwargs = {
    "bins": 150,
    "range": (0.01, 1.51),
    "lw": 0,
    "alpha": 0.5,
    "density": True,
}

plt.hist(
    mcoe_cc.capacity_factor,
    weights=mcoe_cc.capacity_mw,
    label="Combined Cycle",
    **kwargs,
)
plt.hist(
    mcoe_st.capacity_factor,
    weights=mcoe_st.capacity_mw,
    label="Steam Turbine",
    **kwargs,
)
plt.hist(
    mcoe_gt.capacity_factor,
    weights=mcoe_gt.capacity_mw,
    label="Combustion Turbine",
    **kwargs,
)
plt.title("Normalized Capacity Factor by Prime Mover (Weighted by MW)")
plt.xlabel("Capacity Factor")
plt.legend();

image

zaneselvans commented 3 years ago

Missing date-plant-gen combinations

It seems like the problem here was the outer merge happening in the mcoe() function, which brought together EIA-860 generator data and fuel cost / capacity factor data. Somehow, there are combinations of ("report_date", "plant_id_eia", "generator_id") which appear in the fuel cost data, that never show up in the generators_eia860 table, so they end up with all their columns being null, except for those columns which come from the fuel_cost table.

How can this happen? Our entity resolution process should have collected every possible combination of year, plant, and generator, so it shouldn't be possible that a combo shows up in the data somewhere, but not in the entity table. It seems like the most likely scenario is that the use of merge_asof() to connect annual and monthly tables has broadcast some plant_id_eia and generator_id to dates that they were never reported in. How would I test this?

unit_id_pudl needs to be forward / back filled

Looking at the time distribution of NA rows in the MCOE outputs, about half of them are in 2008, with the rest fairly evenly spread throughout the other years. It looks like another thing that's going on here is there are some generators which are associated with unit_id_pudl values in some years, but not in others, and the forward/back filling in the new unit_id_pudl generation process hasn't been turned on. This means that some generators will have valid heat rates in some years, and not others. Not sure if this would play into the above error and the almost all-null rows, but it's definitely reducing overall coverage. To see how this change affects the output, I could either do just the forward/back filling of unit IDs, or I could turn on the larger unit_id assignment process (but that will probably break things since it changes the nature of the foreign key relations in the BGA / generators tables.

zaneselvans commented 3 years ago

The problem here turned out to be not in a mismatch between the MCOE derived values and generators_eia860 but between the fuel_cost and capacity_factor tables, which didn't cover exactly the same set of generator-years and which were in effect being outer-merged, leading to the mostly null rows... because one of them had already been merged with the gens_eia860 before being brought together with the other one...

fuel_cost merged with gens_eia860

# Cut-and-pasted from pudl.analysis.mcoe.mcoe()
merge_cols = ['plant_id_eia', 'generator_id', 'report_date']
drop_cols = [x for x in pudl_out.gens_eia860().columns if x in pudl_out.fuel_cost().columns and x not in merge_cols]
# start with the generators table so we have all of the generators
mcoe_out = pudl.helpers.clean_merge_asof(
    left=pudl_out.fuel_cost().drop(drop_cols, axis=1),
    right=pudl_out.gens_eia860(),
    by={
        "plant_id_eia": "eia",
        "generator_id": "eia",
    }
)
plot_null_row_hist(mcoe_out)

image

Then merged with capacity_factor

mcoe_out = pd.merge(
    mcoe_out,
    pudl_out.capacity_factor()[[
        'report_date',
        'plant_id_eia',
        'generator_id',
        'capacity_factor',
        'net_generation_mwh'
    ]],
    on=['report_date', 'plant_id_eia', 'generator_id'],
    how='outer'
)
plot_null_row_hist(mcoe_out)

The mostly NULL rows create the little spike on the right near 1.0.

image

Outer merge fuel_cost and capacity_factor first

Instead of the above, now it outer-merges the derived MCOE values together first, and only then merges that with the gens_eia860 table. Whether it's a left or right merge depends on if you want all the generator-years (even those that don't show up in the MCOE outputs) or just the attributes of the generator-years which do show up in the MCOE outputs. By default it gets all of them, for backward compatibility. But if you set all_gens=False in the args, you get just the relevant generators.

In the all_gens case, the distribution of Null values is a mix of the nice bell curve from the MCOE outputs, and a sharper bell-ish curve from the gens_eia860 table:

image

I've added integration and data validation tests which check for excessively Null rows in these outputs, which would have caught this problem earlier. I'm also using the same function pudl.validate.no_null_rows() in the mcoe() function as a defensive check.

zaneselvans commented 3 years ago

The new MCOE process works fine in my development notebook, which is running against a version of the PUDL DB which I generated on July 18th from the release-v0.4.0 branch. However, the tests fail due to a small number of rows in the generators_eia860 table which are almost entirely null. They don't look like they're null due to bad merges though. They look more like generator-years that were harvested from tables other than the generator table, meaning that they lack most generator attributes, but will have a few columns filled in like county or prime_mover_code.

However, the ETL process and the pudl.output.eia860 module make no use of the new clean_merge_asof() function, so it seems unlikely that it's the source of the difference. What else has changed in the last month that could be responsible for this cluster of mostly-null rows?

Nullness of MCOE with all generator records included:

image

Nullness of MCOE with only MCOE-involved generator records:

image

Nullness of the generators_eia860 table itself

The cluster of records that are more than 70% is the problem:

image

Investigate:

Fix:

zaneselvans commented 3 years ago

After re-running the ETL it looks like this is mostly just the real null-ness of the generator data, and that it's got some extreme outliers when only a single year is run. Somehow having the full suite of years results in fewer outliers. Maybe because there are more permanent values from the generator entity table that fill in missing blanks.

I thought 2020 (the eia860m) data would have the most nulls, but it turns out this honor goes to 2007, by like a factor of 10x.

I've set the nullness threshold to 0.8 in the data validation tests and 0.9 in the integration tests (which only run with a year's worth of data)