USEPA / ElectricityLCI

Creative Commons Zero v1.0 Universal
24 stars 10 forks source link

Assess replacement of generation.py with alt_generation.py #64

Closed WesIngwersen closed 4 years ago

WesIngwersen commented 4 years ago

The goal is this assessment is to see if alt_generation.py can replace the old generation_py

-Test old generation based EFs from generation.py against those for the same model with alt_generation. Also compare against output from an earlier code version. See if any differences exist

-Evaluate how uncertainty function can be moved over to represent uncertainty method in generation.py along with the new method in alt_generation

-Determine if there are any other output differences to be addressed.

m-jamieson commented 4 years ago

Will likely need to do this by modifying combined_build.py. Main.py has too many checks to force down one path or another.

I'll also offer up some thoughts on the uncertainty generation. Mostly out of my own ignorance, I opted to use np.log to transform the data rather than using the existing equation to estimate the geometric mean. As a result, there are many species that don't get uncertainty under alt_generation because of the errors generated by log 0. I would be interested in replacing all zeros with some small number (i.e., 1E-15 or something) to reduce this. Another thing that I changed was that from what I could tell, the 90% confidence interval was previously calculated using the non-transformed values. I think the more correct approach is to calculate the confidence interval using the log of the values since that's what gives the normal distribution for the t-function. I imagine this probably wasn't done before because of the existence of zeros.

WesIngwersen commented 4 years ago

@jump2conclusionsmatt @TJTapajyoti When I try running get_generation_process_df() without specifying upstream (because it doesn't make sense) for model 3 with use_alt_gen_process: True , it seems like I get directed to the create_ba_region_map via the get_alternate_gen_plus_netl() function (see error below). I don't think we want to use get_alternate_gen_plus_netl() in this case.. I really think the route of get_alternate_gen_plus_netl() should be an option in the config file, with a param like

include_construction_impacts_for_renewables_in_generation

KeyError: 'eGRID' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "C:\Users\wesle\AppData\Local\Programs\Python\Python37\lib\site-packages\IPython\core\interactiveshell.py", line 3326, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "", line 1, in all_generation_db = electricitylci.get_generation_process_df() File "C:\Users\wesle\ElectricityLCI\electricitylci__init.py", line 79, in get_generation_process_df gen_df = get_alternate_gen_plus_netl() File "C:\Users\wesle\ElectricityLCI\electricitylci\init__.py", line 401, in get_alternate_gen_plus_netl hydro_df = hydro.generate_hydro_emissions() File "C:\Users\wesle\ElectricityLCI\electricitylci\hydro_upstream.py", line 51, in generate_hydro_emissions eia860_df=eia860_balancing_authority(2016) File "C:\Users\wesle\ElectricityLCI\electricitylci\eia860_facilities.py", line 157, in eia860_balancing_authority region_map = create_ba_region_map(region_col=regional_aggregation) File "C:\Users\wesle\ElectricityLCI\electricitylci\utils.py", line 81, in create_ba_region_map f'regional_col value is {region_col}, but should match "ferc_region" ' TypeError: exceptions must derive from BaseException

tjlca commented 4 years ago

db1 = get_generation_process_df() A kwarg named 'upstream_dict' must be included if use_alt_gen_process is True Generating inventories for geothermal, solar, wind, hydro, and solar thermal... Loading 2017 EIA-923 data from csv file Loading 2016 EIA-860 plant data from csv file Traceback (most recent call last):

File "", line 1, in db1 = get_generation_process_df()

File "C:/Users/ghosh.117/Google Drive/box/Research_compile/electricitylci_old/ElectricityLCI-master/elci_new/ElectricityLCI-master/electricitylci/init.py", line 83, in get_generation_process_df gen_df = get_alternate_gen_plus_netl()

File "C:/Users/ghosh.117/Google Drive/box/Research_compile/electricitylci_old/ElectricityLCI-master/elci_new/ElectricityLCI-master/electricitylci/init.py", line 421, in get_alternate_gen_plus_netl hydro_df = hydro.generate_hydro_emissions()

File "C:\Users\ghosh.117\Google Drive\box\Research_compile\electricitylci_old\ElectricityLCI-master\elci_new\ElectricityLCI-master\electricitylci\hydro_upstream.py", line 51, in generate_hydro_emissions eia860_df=eia860_balancing_authority(2016)

File "C:\Users\ghosh.117\Google Drive\box\Research_compile\electricitylci_old\ElectricityLCI-master\elci_new\ElectricityLCI-master\electricitylci\eia860_facilities.py", line 157, in eia860_balancing_authority region_map = create_ba_region_map(region_col=regional_aggregation)

File "C:\Users\ghosh.117\Google Drive\box\Research_compile\electricitylci_old\ElectricityLCI-master\elci_new\ElectricityLCI-master\electricitylci\utils.py", line 79, in create_ba_region_map f'regional_col value is {region_col}, but should match "ferc_region" '

TypeError: exceptions must derive from BaseException

m-jamieson commented 4 years ago

So I think I'm pretty close to having something working in the new branch. I think it would largely work now, but right now the configuration is using primary_fuel for coal and also eGRID data for the generation mix, which from what I can tell can't support using the primary fuel for coal because it doesn't exist in egrid_subregion_generation_by_fuelcategory_reference_2016. So basically what happens is that the generation mixer is looking for "COAL - SRMV" but only finding "BIT - SRMV" or whatever may be the case. Does this sound at all familiar to you two?

WesIngwersen commented 4 years ago

@jump2conclusionsmatt As I recall, we more or less dropped support for using the primary fuel for coal. The final versions that we used for export into openLCA just aggregate for 'COAL'. The simple option is to just change that parameter value (and potentially remove it).

WesIngwersen commented 4 years ago

@jump2conclusionsmatt Right now its still trying to get the upstream data for model 3, even though include_upstream_processes: False and include_renewable_generation: False through the import of emissions_other_source in alt_generation, which itself import coal_upstream and upon import, executes generate_upstream_coal(eia_gen_year).

m-jamieson commented 4 years ago

So it does do that - I think those are artifacts from very early discussions with Schivley about how we were going to integrate the upstream emissions. This should be fixed even though I don't think it was every getting to the place to actually add the upstream emissions. Also, it should've only gotten into that module if replace_egrid is True. Is that something you were running?

Changed the primary fuel for coal to False and generated results. They appear to me to be good.

I think that branch is now capable of producing eGRID results using generation.py and alt_generation.py.

Let me know if you would like some more help comparing results across the two different modules. There's just so much to check. I did a quick check of CAMX - there are 10 or so emissions that differ by more than 5% for the regular analysis. Once you go into Monte Carlo, there are even more >5% changes of the mean. I guess this should be expected given the modifications to the calculation. A couple of SQL queries should generate the distributions for all flows in both databases pretty easily. Otherwise, we're looking at using combined_build and build to grab dataframes.

One kind of major thing that I noticed is that the waste flows are being kept in alt_gen but discarded in generation.py. Not sure if this was intentional.

m-jamieson commented 4 years ago

I'm continuing to look at some things here - I was comparing the Monte Carlo results as a way to check the uncertainty differences between generation.py and alt_generation.py, and found that in some cases, the final emission factor was way bigger than the calculated 95th percentile for generation (see CAMX, solar, carbon dioxide for an example) because an outlier is orders of magnitude greater than the rest. This resulted in some weird Monte Carlo results (like 10^16 kg CO2e/MWh) for CAMX consumption mix at user.

Since it's been quiet here, I'm assuming there hasn't been much progress elsewhere. I'm going to write a script to generate aggregate emissions using generation and alt_generation and then pull the raw data for eGRID/tech/flow combos that have significant differences between the emission factor or differences in generated lognormal distributions.

m-jamieson commented 4 years ago

I've fixed an issue in the branch where alt_generation was replacing the egrid electricity with EIA923 data despite replace_egrid being False. With both generation and alt_generation using the same electricity, there are only 75 emission factors (out of 7,700) or so show emission factor differences greater than 5%. As far as I can tell, in all these cases the differences arise from a duplicates issue I previously brought up in an email - I think the latest email was on 8/6 "Question on electricityLCI". Alt_generation does not remove duplicates based on FlowAmount, FuelCategory, Subregion, etc. - thereby assuming that duplicates are true. Anyhow, this has two effects. In some cases, the numerators are different between generation and alt because a facility's emissions have been removed. This can also affect denominators - if a facility is completely removed because it is assumed to be a duplicate, it removes that facility from the list of generators for the denominator.

Anyhow, I'm pretty confident in the emission factors being the same between the two methods. I'm inclined to keep the "duplicates" because as stated in the email, the boundary at which you apply the search for duplicates is pretty arbitrary (FERC, NERC, eGRID, Balancing Authority, etc.).

Then there's uncertainty. In alt_generation, the geometric mean is set equal to the final calculated emission factor. This is different from generation which appears to calculate the geometric mean such that the emission factor is arithmetic mean of the lognormal distribution. I think I used to have a better defense of this, but as I write this, I think that's gone. Having the mean of a simulation be the same as the emission factor is probably the right answer and would look more like what the actual distribution is. I'll make the change.

WesIngwersen commented 4 years ago

Regarding uncertainty, I can say confidently from experience that the geometric mean, and not the arithmetic mean, should be used as the mean of a log normal distribution. The openLCA schema does have a field geomMean for representing this as different than the Exchange Amount fields

tjlca commented 4 years ago

I re-checked the generation.py uncertainty aggregator calculation.

Just to confirm - The mean of the logarithmic distribution is not the arithmetic emission factor. Its calculated from the ef and the standard deviations calculated from the 90% confidence intervals. As Wes mentioned, open LCA does not have arithmetic mean. Only geometric mean. Once we had got the values, we tested the simulation in openLCA. During the simulation, open LCA converts the GM to the AM and shows a distribution. They were pretty close to the AM or efs.

Thanks, TJ

Sent from Mail for Windows 10

From: Matt Jamieson Sent: Thursday, October 31, 2019 3:00 PM To: USEPA/ElectricityLCI Cc: Tapajyoti Ghosh; Mention Subject: Re: [USEPA/ElectricityLCI] Assess replacement of generation.py withalt_generation.py (#64)

I've fixed an issue in the branch where alt_generation was replacing the egrid electricity with EIA923 data despite replace_egrid being False. With both generation and alt_generation using the same electricity, there are only 75 emission factors (out of 7,700) or so show emission factor differences greater than 5%. As far as I can tell, in all these cases the differences arise from a duplicates issue I previously brought up in an email - I think the latest email was on 8/6 "Question on electricityLCI". Alt_generation does not remove duplicates based on FlowAmount, FuelCategory, Subregion, etc. - thereby assuming that duplicates are true. Anyhow, this has two effects. In some cases, the numerators are different between generation and alt because a facility's emissions have been removed. This can also affect denominators - if a facility is completely removed because it is assumed to be a duplicate, it removes that facility from the list of generators for the denominator. Anyhow, I'm pretty confident in the emission factors being the same between the two methods. I'm inclined to keep the "duplicates" because as stated in the email, the boundary at which you apply the search for duplicates is pretty arbitrary (FERC, NERC, eGRID, Balancing Authority, etc.). Then there's uncertainty. In alt_generation, the geometric mean is set equal to the final calculated emission factor. This is different from generation which appears to calculate the geometric mean such that the emission factor is arithmetic mean of the lognormal distribution. I think I used to have a better defense of this, but as I write this, I think that's gone. Having the mean of a simulation be the same as the emission factor is probably the right answer and would look more like what the actual distribution is. I'll make the change. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

m-jamieson commented 4 years ago

I think we're on the same page. _alt/generation.py has been changed to be consistent with generation.py with the exception of the final returned values. I think the values being returned before weren't quite correct - but still somehow ended up providing distributions that were pretty close to the targets (i.e., sample arithmetic mean equal to the emission factor and 95th percentile being equal to the 90% confidence interval). It has to do with translating the calculated sigma into the geometric standard deviation. I did test the final implemented equations analytically as well as in openLCA.

There's one other consequence of the approach - _alt/generation.py doesn't generate a distribution for a lot flows that generation.py does. In alt_generation, I run a check to see if the emission factor is greater than the 90% confidence interval (caused by significant outliers) - if it is nothing is returned. generation has no such check and as a consequence does find solutions to the quadratic, but I don't think those solutions are meaningful. In some cases, it is a shame to lose the uncertainty characterization though because some of these flows have a larger number of samples. Not really sure of solutions. Using the geometric mean of the emission factors could be used but would sacrifice getting the emission factor as the arithmetic mean of the simulation. Alternatively, the outliers could be removed?

m-jamieson commented 4 years ago

The work of replacing generation.py with alt_generation.py has been completed in the unification branch.

I also added some checks of model_config that will raise exceptions for certain incompatible combinations. I only have a few - I'm sure there are plenty more. This may be the spot where we limit the years for some datasets, etc.

WesIngwersen commented 4 years ago

@jump2conclusionsmatt Great. Will you make a pull request?