Stress test the new MCOE outputs

zaneselvans commented 3 years ago

Fixing the merge_on_date_year issues has the potential to affect many of the MCOE and EIA 923 outputs. Do sanity checks and visualizations to make sure nothing is amiss, and if possible integrate them into the data validation or integration tests.

Weirdness to look into

[x] The only values of prime_mover_code which show up in the MCOE outputs are [ST, CT, CA, CS, <NA>] and about 9% of all records are NA. There are no gas turbines or internal combustion engines included in the MCOE outputs.
- The fact that we only have heat rates for the CC and ST units makes sense, given that this analysis currently depends on the boiler-generator-association table, and gas turbines have never had boilers associated with them, so they wouldn't get unit IDs.
[x] About 8% of all records have NA values for unit_id_pudl -- is this expected?
- All records with NA values for unit_id_pudl also have NA values for heat_rate_mmbtu_mwh which makes sense. These records are probably in here only because the generators_eia860 table is being used as the backbone for the MCOE output.
[ ] About 8% of all records have NA values for fuel_type_code_pudl.
- A large proportion of these generator records also have valid non-null fuel_type_code_pudl values in the generators_eia860 table, so this information is getting lost somewhere along the way as well.
[ ] There are generators that have values for capacity_factor and net_generation_mwh but where capacity_mw is NA
- All of the records with NA values for capacity_mw but non-null values for capacity_factor correspond to generators which do, in fact, have capacity factors in the generators_eia860 table, so we are messing up a merge here somehow, and probably losing other generator attributes. Need to chase this down.

zaneselvans commented 3 years ago

Some Visualizations

pudl_out_monthly = pudl.output.pudltabl.PudlTabl(pudl_engine=pudl_engine, freq="MS")
mcoe = pudl_out_monthly.mcoe()[
    [
        "report_date",
        "plant_id_eia",
        "unit_id_pudl",
        "generator_id",
        "capacity_mw",
        "capacity_factor",
        "fuel_cost_per_mmbtu",
        "fuel_cost_per_mwh",
        "fuel_type_code_pudl",
        "heat_rate_mmbtu_mwh",
        "net_generation_mwh",
        "prime_mover_code",
    ]
]

mcoe_cc = mcoe[
    (mcoe.net_generation_mwh.notna())
    & (mcoe.prime_mover_code.isin(["CC", "CT", "CA", "CS"]))
]

mcoe_st = mcoe[
    (mcoe.net_generation_mwh.notna())
    & (mcoe.prime_mover_code == "ST")
]

mcoe_gt = mcoe[
    (mcoe.net_generation_mwh.notna())
    & (mcoe.prime_mover_code == "GT")
]

kwargs = {
    "bins": 100,
    "range": (6, 16),
    "lw": 0,
    "alpha": 0.5,
}
plt.hist(
    mcoe_cc.heat_rate_mmbtu_mwh,
    weights=mcoe_cc.net_generation_mwh,
    label="Combined Cycle",
    **kwargs,
)
plt.hist(
    mcoe_st.heat_rate_mmbtu_mwh,
    weights=mcoe_st.net_generation_mwh,
    label="Steam Turbine",
    **kwargs,
)
plt.hist(
    mcoe_gt.heat_rate_mmbtu_mwh,
    weights=mcoe_gt.net_generation_mwh,
    label="Combustion Turbine",
    **kwargs,
)

plt.title("Monthly Average Heat Rate by Prime Mover (Weighted by MWh)")
plt.xlabel("Heat Rate (MMBTU/MWh)")
plt.ylabel("MWh")
plt.legend()
plt.show();

mcoe_gas = mcoe[
    (mcoe.net_generation_mwh.notna())
    & (mcoe.fuel_type_code_pudl == "gas")
]

mcoe_coal = mcoe[
    (mcoe.net_generation_mwh.notna())
    & (mcoe.fuel_type_code_pudl == "coal")
]

mcoe_oil = mcoe[
    (mcoe.net_generation_mwh.notna())
    & (mcoe.fuel_type_code_pudl == "oil")
]

plt.hist(
    mcoe_gas.heat_rate_mmbtu_mwh,
    weights=mcoe_gas.net_generation_mwh,
    label="Gas",
    **kwargs,
)
plt.hist(
    mcoe_coal.heat_rate_mmbtu_mwh,
    weights=mcoe_coal.net_generation_mwh,
    label="Coal",
    **kwargs,
)
plt.hist(
    mcoe_oil.heat_rate_mmbtu_mwh,
    weights=mcoe_oil.net_generation_mwh,
    label="Oil",
    **kwargs,
)
plt.title("Monthly Average Heat Rate by Primary Fuel (Weighted by MWh)")
plt.xlabel("Heat Rate (MMBTU/MWh)")
plt.ylabel("MWh")
plt.legend()
plt.show();

kwargs = {
    "bins": 150,
    "range": (0.01, 1.51),
    "lw": 0,
    "alpha": 0.5,
    "density": True,
}

plt.hist(
    mcoe_cc.capacity_factor,
    weights=mcoe_cc.capacity_mw,
    label="Combined Cycle",
    **kwargs,
)
plt.hist(
    mcoe_st.capacity_factor,
    weights=mcoe_st.capacity_mw,
    label="Steam Turbine",
    **kwargs,
)
plt.hist(
    mcoe_gt.capacity_factor,
    weights=mcoe_gt.capacity_mw,
    label="Combustion Turbine",
    **kwargs,
)
plt.title("Normalized Capacity Factor by Prime Mover (Weighted by MW)")
plt.xlabel("Capacity Factor")
plt.legend();

zaneselvans commented 3 years ago

Missing date-plant-gen combinations

It seems like the problem here was the outer merge happening in the mcoe() function, which brought together EIA-860 generator data and fuel cost / capacity factor data. Somehow, there are combinations of ("report_date", "plant_id_eia", "generator_id") which appear in the fuel cost data, that never show up in the generators_eia860 table, so they end up with all their columns being null, except for those columns which come from the fuel_cost table.

How can this happen? Our entity resolution process should have collected every possible combination of year, plant, and generator, so it shouldn't be possible that a combo shows up in the data somewhere, but not in the entity table. It seems like the most likely scenario is that the use of merge_asof() to connect annual and monthly tables has broadcast some plant_id_eia and generator_id to dates that they were never reported in. How would I test this?

[x] Look at the capacity_factor output directly, and extract the list of generator-dates that appear in it. Compare them to the equivalent generator-years that show up in the generators_eia860 output.

`unit_id_pudl` needs to be forward / back filled

Looking at the time distribution of NA rows in the MCOE outputs, about half of them are in 2008, with the rest fairly evenly spread throughout the other years. It looks like another thing that's going on here is there are some generators which are associated with unit_id_pudl values in some years, but not in others, and the forward/back filling in the new unit_id_pudl generation process hasn't been turned on. This means that some generators will have valid heat rates in some years, and not others. Not sure if this would play into the above error and the almost all-null rows, but it's definitely reducing overall coverage. To see how this change affects the output, I could either do just the forward/back filling of unit IDs, or I could turn on the larger unit_id assignment process (but that will probably break things since it changes the nature of the foreign key relations in the BGA / generators tables.

[ ] Forward & back fill unit_id_pudl values in the generators_eia860 table and see how it impacts the MCOE outputs.

zaneselvans commented 3 years ago

The problem here turned out to be not in a mismatch between the MCOE derived values and generators_eia860 but between the fuel_cost and capacity_factor tables, which didn't cover exactly the same set of generator-years and which were in effect being outer-merged, leading to the mostly null rows... because one of them had already been merged with the gens_eia860 before being brought together with the other one...

`fuel_cost` merged with `gens_eia860`

# Cut-and-pasted from pudl.analysis.mcoe.mcoe()
merge_cols = ['plant_id_eia', 'generator_id', 'report_date']
drop_cols = [x for x in pudl_out.gens_eia860().columns if x in pudl_out.fuel_cost().columns and x not in merge_cols]
# start with the generators table so we have all of the generators
mcoe_out = pudl.helpers.clean_merge_asof(
    left=pudl_out.fuel_cost().drop(drop_cols, axis=1),
    right=pudl_out.gens_eia860(),
    by={
        "plant_id_eia": "eia",
        "generator_id": "eia",
    }
)
plot_null_row_hist(mcoe_out)

Then merged with `capacity_factor`

mcoe_out = pd.merge(
    mcoe_out,
    pudl_out.capacity_factor()[[
        'report_date',
        'plant_id_eia',
        'generator_id',
        'capacity_factor',
        'net_generation_mwh'
    ]],
    on=['report_date', 'plant_id_eia', 'generator_id'],
    how='outer'
)
plot_null_row_hist(mcoe_out)

The mostly NULL rows create the little spike on the right near 1.0.

Outer merge `fuel_cost` and `capacity_factor` first

Instead of the above, now it outer-merges the derived MCOE values together first, and only then merges that with the gens_eia860 table. Whether it's a left or right merge depends on if you want all the generator-years (even those that don't show up in the MCOE outputs) or just the attributes of the generator-years which do show up in the MCOE outputs. By default it gets all of them, for backward compatibility. But if you set all_gens=False in the args, you get just the relevant generators.

In the all_gens case, the distribution of Null values is a mix of the nice bell curve from the MCOE outputs, and a sharper bell-ish curve from the gens_eia860 table:

I've added integration and data validation tests which check for excessively Null rows in these outputs, which would have caught this problem earlier. I'm also using the same function pudl.validate.no_null_rows() in the mcoe() function as a defensive check.

zaneselvans commented 3 years ago

The new MCOE process works fine in my development notebook, which is running against a version of the PUDL DB which I generated on July 18th from the release-v0.4.0 branch. However, the tests fail due to a small number of rows in the generators_eia860 table which are almost entirely null. They don't look like they're null due to bad merges though. They look more like generator-years that were harvested from tables other than the generator table, meaning that they lack most generator attributes, but will have a few columns filled in like county or prime_mover_code.

However, the ETL process and the pudl.output.eia860 module make no use of the new clean_merge_asof() function, so it seems unlikely that it's the source of the difference. What else has changed in the last month that could be responsible for this cluster of mostly-null rows?

Nullness of MCOE with all generator records included:

Nullness of MCOE with only MCOE-involved generator records:

Nullness of the `generators_eia860` table itself

The cluster of records that are more than 70% is the problem:

Investigate:

[x] Re-run the full ETL and see what the null-ness looks like across years beyond 2019-2020.
[x] Did something change about how the YTD eia860 records are integrated? Both the old and new DBs include them.
[x] Characterize the mostly-null generators_eia860 records. What's similar about them, if anything?
[x] Check in with @cmgosnell about what other changes might have taken place in the last month that would have touched this.

Fix:

[x] This is a new and different problem. The MCOE issue has been fixed, and if we use thresh=0.9 instead of thresh=0.8 in the tests, it would still have caught the MCOE problem, without being tripped up by the extra-nulls-in-gens_eia860 issue. So maybe just up the threshold and create an issue to investigate the gens_eia860 nullness later.

zaneselvans commented 3 years ago

After re-running the ETL it looks like this is mostly just the real null-ness of the generator data, and that it's got some extreme outliers when only a single year is run. Somehow having the full suite of years results in fewer outliers. Maybe because there are more permanent values from the generator entity table that fill in missing blanks.

I thought 2020 (the eia860m) data would have the most nulls, but it turns out this honor goes to 2007, by like a factor of 10x.

I've set the nullness threshold to 0.8 in the data validation tests and 0.9 in the integration tests (which only run with a year's worth of data)

catalyst-cooperative / pudl