Who's putting NaNs in my exchange tables?

dt-woods commented 3 months ago

Processes are failing to load in openLCA due, in part, to a large number of NaNs found as amounts in process exchanges.

dt-woods commented 3 months ago

Going line by line through the generation process inventory generation, everything seems okay except for HYDRO. The problem comes from a blank line in the source CSV file for facility 1784, Twin Falls (MI). https://github.com/USEPA/ElectricityLCI/blob/2232c41f2cb4fd333ad59c8710aa55906e6a7ed3/electricitylci/data/hydropower_plant.csv#L314

UPDATE:

From G. Zaimes,

If I remember correctly, we trained a predictive model using several variables such as the area of the hydroelectric impoundment, age, latitude, longitude, and other factors. It may be that some of these factors were not defined or available for the plant you noted, although it is difficult for me to comment without looking back in the data/code. Another reason could be that the Hydropower model is calibrated to estimate emissions from conventional Hydropower facilities and not Run-Of-River facilities. It may be that the Twin Falls plant is a RoR Hydropower plant and thus we could not provide an accurate estimate of its GHG emissions intensity using our existing model (this might be the more likely culprit). I believe there was some logic in the Hydropower model to check whether a facility is Conventional vs RoR, it may be through double checking to see if this is causing the NaN entries you observe.

dt-woods commented 3 months ago

Found a second culprit in hydro upstream where "water" was misspelled. Now there are only three NaNs caused by the blank line above. https://github.com/USEPA/ElectricityLCI/blob/2232c41f2cb4fd333ad59c8710aa55906e6a7ed3/electricitylci/hydro_upstream.py#L59

dt-woods commented 3 months ago

Everything's good with generation processes, but when combined with upstream processes through __init__.write_gen_fuel_database_to_dict --> generation.olcaschema_genprocesses, loads of NaNs appear.

dt-woods commented 3 months ago

In main.py, the data frame returned by get_generation_process_df converts the column "FlowAmount" from float to object. A clue!

dt-woods commented 3 months ago

In main.py, the data frame returned by get_generation_process_df converts the column "FlowAmount" from float to object. A clue!

It looks like the culprit is generation.py's create_generation_process_df, which returns object (i.e., str) flow amounts.

dt-woods commented 3 months ago

FOUND! Setting the NOx (lbs) column to NoneType makes it string. Remove it!

https://github.com/KeyLogicLCA/ElectricityLCI/blob/2232c41f2cb4fd333ad59c8710aa55906e6a7ed3/electricitylci/ampd_plant_emissions.py#L606

dt-woods commented 3 months ago

A couple of new updates for Monday.

ElectricityLCI's get_gen_plus_netl function creates all the renewable fuel data frames (i.e., geo, solar, wind, and hydro). All fuels, except hydro, are matched to their respective EIA generation year (e.g., 2022). Hydro, on the other hand, is always set to 2016. Notably, the year 2016 is not found in the years of interest (from the inventories of interest) for any of the new model configs (i.e., 2020–2022). This becomes important when considering further down in get_get_plus_netl where generation.py's create_generation_process_df function is called. This method scans over all inventories of interest (e.g., 2019–2021) to gather up all the relevant facilities (note how 2016 facilities are not included).
The last little tidbit from get_gen_plus_netl calls combinator.py's concat_clean_upstream_and_plant, which, in turn runs fill_nans, which attempts to fill missing facility IDs based on the current year's generation data (EIA generation year), which may be 2022 and is not a part of any of the years for the inventory (i.e., hydro is 2016, and inventory years of interest are 2019–2021). :worried:

More to come from this investigation.

dt-woods commented 3 months ago

After all the fixes above; the nans in the exchange table remain. Further tracing leads me to generation.py's olcaschema_genprocess method. Walking through the turn_data_into_dict leads to this interesting bit of logic:

# Define inputs based on compartment label
data["input"] = False
input_filter = (
    (data["Compartment"].str.lower().str.contains("input"))
    | (data["Compartment"].str.lower().str.contains("resource"))
    | (data["Compartment"].str.lower().str.contains("technosphere"))
)
data.loc[input_filter, "input"] = True

# Define products based on compartment label
product_filter=(
    (data["Compartment"].str.lower().str.contains("technosphere"))
    | (data["Compartment"].str.lower().str.contains("valuable"))
)
data.loc[product_filter, "FlowType"] = "PRODUCT_FLOW"

# Define wastes based on compartment label
waste_filter = (
    (data["Compartment"].str.lower().str.contains("technosphere"))
)
data.loc[waste_filter, "FlowType"] = "WASTE_FLOW"

If I'm right, the compartment with "technosphere" in its name is always labeled with a flow type as a waste flow, due to the third clause above, so why have it the product flow query?

https://github.com/USEPA/ElectricityLCI/blob/e56268132f7607ead58a33bb5bdd525563a784f5/electricitylci/generation.py#L1091

@m-jamieson, do you remember how this labeling works?

dt-woods commented 3 months ago

Okay. The NaNs are coming from aggregate_data in generation.py somewhere. Two possible paths forward:

Keep digging through the aggregation method to see where they coming from and handle appropriately.
Modify the turn_data_to_dict to drop exchanges that have nan as their emission factor.

m-jamieson commented 3 months ago

I'm alright with 2 - I mean that's basically what happened in post-processing in the first fed commons database. Would be nice to get some idea of what's getting dropped though and why. It can only either be the sum of the emissions of the aggregated group or the electricity. I would guess the electricity route is more likely.

dt-woods commented 3 months ago

The problem is with generation_process_df. There are NaNs in the Emission_factor column associated with these flows;

'Carbon dioxide', 'Methane', 'Nitrogen oxides', 'Nitrous oxide', 'Sulfur dioxide', 'natural gas, through transmission', 'petroleum fuel, through transportation', 'coal, transported', 'coal, processed, at mine', 'power plant construction', 'Ammonia', 'Benzene', 'Ethylbenzene', 'Hexane', 'Hydrochloric acid', 'Lead', 'Manganese', 'Mercury', 'Naphthalene', 'Nickel', 'Toluene', 'Xylene', 'Polycyclic aromatic compounds', 'Benzo[ghi]perylene', 'nuclear fuel, through transportation', 'Anthracene', 'Phenanthrene', 'Hydrogen sulfide', 'Hydrofluoric acid', 'Chlorine', 'Styrene', 'Cobalt', 'Ethylene glycol'

Across just about all BAAs and stage codes.

Note that power plant construction, coal, natural gas, and petroleum NaNs are the cause for issue https://github.com/USEPA/ElectricityLCI/issues/250

m-jamieson commented 3 months ago

eh...that's a lot....like basically everything. This also suggests that all data sources are involved too. Is the current Keylogic development branch the most reasonable one to try and run to replicate? And then what data year are you running to get these nans?

dt-woods commented 3 months ago

Yes and ELCI_2022_config.yaml

m-jamieson commented 3 months ago

Just adding some things here as I debug in case I forget to come back to this later. As best as I can tell (or as pandas tells me) there are no NaN FlowAmounts or Electricity in the dataframe prior to the aggregation. There are however a significant number of NaNs in some of the columns that are included in the groupby (like Balancing Authority Name and FuelCategory). While I think this should mean that those facilities aren't included in the groupby, it's probably still not good.

m-jamieson commented 3 months ago

As suspected, when the electricity amounts are merged back with the aggregated flow amounts here, there are a bunch of missing electricity amounts in the resulting database_f3. Which suggests that the electricity amounts aren't being summed correctly in calculate_electricity_by_source. And this occurs for mutliple data source combinations:

NEI_ampd_ap42_eGRID
NEI_ap42_eGRID
ampd_ap42_eGRID
NEI_RCRAInfo_TRI_ampd_ap42_eGRID_eia_netl
NEI_TRI

m-jamieson commented 3 months ago

Never mind it's a year mis-match problem. For example: Arizona Public Service Company - BIOMASS - Carbon dioxide has source string "NEI_ampd_ap42_eGRID" and year 2022. The electricity dataframe has a matching source string "NEI_ampd_ap42_eGRID" but it's for year 2021.

Just to describe what's supposed to be happening here. There are two biomass plants reporting CO2 emissions in the Arizona Public Service Company BA in the year 2022. We can easily sum these. The electricity that's supposed to be assigned to this aggregation is from all plants that reported to the same emission sources. In this case, these two plants report to NEI, ampd, ap42, and eGRID - or at least that's what the fetched data suggests. But there are very few years where plants will be reporting to all the same data sources in the same year. This very likely should be source string "ampd_ap42" for year 2022 and "NEI" for year 2021 or something like that. calculate_electricity_by_source likely needs to be tweaked to differentiate data reporting by year or fix how it's done now.

m-jamieson commented 3 months ago

I think this commit created the current issue: https://github.com/KeyLogicLCA/ElectricityLCI/commit/946463a25eb25ce583707d187523c8c5548b0680

We groupby a different set of columns "grouby_cols" than before ["FlowName","Compartment"] because before we were trying to omit flows that were only ever reported by one source...it doesn't really matter what year the data comes from at that point because there will only ever be a single year for a single data source. However, when multiple sources are involved, we need to make sure there's source grouping per year to prevent the missing electricity flows as discussed in the previous comment.

m-jamieson commented 3 months ago

After replacing the ["FlowName","Compartment"] in the groupbys below, all the blank electricity amounts for the emissions go away, leaving just the inputs as blank electricity, which are all source_list "NEI_RCRAInfo_TRI_ampd_ap42_eGRID_eia_nei" or in other words all sources. Coming into calculate_electricity_by_source the source for all the fuel inputs is "eia", water inputs are "netl", and heat inputs are "eGRID." I think the original idea here is that we'd sum up electricity from all sources to divide fuel inputs by electricity from all plants in the matching year. I think the drop_duplicates happening here needs to have "Year" added to the list ["eGRID_ID"], so ["eGRID_ID","Year"] to make sure that all years of all sources are calculated and available to join to the fuel inputs later on. I'm not 100% satisfied with this because I think the "source" for the fuel inputs should say "eia" once they make it into openLCA.

            source_df = pd.DataFrame(
                db_multiple_sources.groupby(groupby_cols)[
                    ["Source"]].apply(combine_source_lambda),
                columns=["source_list"],
            )
            source_df[["source_list", "source_string"]] = pd.DataFrame(
                source_df["source_list"].values.tolist(),
                index=source_df.index
            )
            source_df.reset_index(inplace=True)
            db_multiple_sources.drop(
                columns=["source_list", "source_string"], inplace=True
            )
            old_index = db_multiple_sources.index
            db_multiple_sources = db_multiple_sources.merge(
                right=source_df,
                left_on=groupby_cols,
                right_on=groupby_cols,
                how="left",
            )

m-jamieson commented 3 months ago

Was able to do a test run with the fix above, and it worked as intended - fuel flows are again getting electricity matches, but the facility counts aren't lining up. From an emissions perspective, this is an expected part of the method - the number of reported emissions of a species can be less than the number of facilities reporting electricity generation. For fuel inputs though, this means that resulting plant processes are being more efficient than they should be. In the test run that I did PJM GAS has electricity summed for 170 facilities, but gas inputs from Appalachian and Arkoma basins only sum to 143. Guess this comes down to not having entries in gas_supply_basin_mapping.csv.

As best as I can tell, there are electricity flows for all inputs coming into aggregate, so maybe the answer here is to not separate between "power" and "nonpower" and just run the electricity summing for all entries.

m-jamieson commented 3 months ago

Tried the test mentioned in the previous comment and am still getting large mismatches in facility counts due to the lack of matching basin matching data. Power plant construction amounts are listed as eia data and are automatically generated, so there ends up being more of them.

I guess some one solution here are to make special labels for fuel amounts like "eia_fuel" or change construction or both/all. I think adding different labels for the different data sources - like "netlgas" as the source for the upstream gas inventory makes sense and will be clearer in the resulting openLCA metadata. Hard part will be finding all the holes where maybe source wasn't explicitly defined because there are a couple places where Source is defined across combinations of dataframes

USEPA / ElectricityLCI

Who's putting NaNs in my exchange tables? #246