Map EPA unit_id to EIA generator_id

grgmiller commented 4 years ago

Issue Context

Within the CEMS data for each year, there are on the order of ~1 million observations where the co2_mass_tons observation is missing, even when the generator is reporting an operating time >0 and heat input >0. For my research, I am planning to fill in these missing values based on a calculation and add a "calculated" value to the co2_mass_measurement_code column. I plan to calculate these missing values by multiplying the heat_content_mmbtu column by the fuel-specific emission factor for the fuel used by that generator. To do that, I need to match each unit in cems with its corresponding fuel type reported in boiler_fuel_eia923.csv

Question 1: What is the proper mapping between EPA plant/units and EIA plant/units?

This is my current understanding, which may not be correct:

EPA plant_id_eia maps to EIA plant_id_eia
EPA unitid maps to EIA boiler_id
EPA facility_id -> not sure what this maps to, if anything
EPA unit_id_epa -> not sure what this maps to, if anything

Basically, is it correct to match the CEMS unitid column to the EIA boiler_id column in boiler-fuel_eia923.csv or is there some other mapping I need to complete first?

Question 2: Harmonizing unit IDs between EPA and EIA

After attempting to merge the fuel_type_code column from boiler_fuel_eia923 into my epacems data, I am still finding that it is unable to to find a matching boiler_id key for many observations. When investigating further, I found that it seems that these ids have not yet been standardized. For example, for plant_id 10378, epacems lists the unitid as BLR02B where EIA lists the boiler_id as simply 2B.

It seems that based on https://github.com/catalyst-cooperative/pudl/issues/178 the ORISPL (plant_id_eia) codes have been harmonized between the two datasets, but I am wondering if the unitid column from cems has been harmonized with the boiler_id column from EIA?

zaneselvans commented 4 years ago

Okay I'm going to shamelessly rope @karldw in here since he's worked more with the epacems data than we have. However, I am pretty sure that unitid in the EPA CEMS does not map (simply) to boiler_id in the EIA 923. IIRC, the EPA CEMS data is reported not reported on the basis of boilers or generators or generation units, but on the basis of something approximating a "smokestack" -- a collection of fuel-consuming things that share a single emissions conduit, where the emissions sensor is installed. I'm sure in many cases this is in fact a single boiler, and there's a simple connection that could be made between the unitid and the boiler_id but as you've noticed, they don't use the same codes, so it's not totally straightforward to connect them.

Performing this record linkage is one of the main tasks that we have on our plate with respect to EPA CEMS right now, if you're interested in helping out! I believe there may be some useful but as of yet not integrated information related to the connections between smokestacks, boilers, and generators lurking in the EIA 860 Environmental Associations spreadsheets. With the integration of the EIA861 data we're trying to redesign the way we do ETL on collections of spreadsheets to make it cleaner and more generalizable, with the intention of backporting that process into the EIA 923 and EIA 860 ETL, in which case I think we would expand the data we're pulling to include absolutely all of the tables included in those data.

I believe the facility_id is an EPA assigned ID, from a database that applies to many different types of facilities tracked by the agency. I think @gschivley posted something a while ago referring to an EPA ID reconciling tool? Maybe it also includes information on the boiler-generator-plant-unit-facility mappings?

I am legit confused as to what the unit_id_epa refers to, and the metadata we have on file says that it's assigned by EIA... which seems like it's probably incorrect.

Also, are you actually loading the CSV files directly (e.g. into dataframes off of disk) or are you using them to populate a database (e.g. with datapkg_to_sqlite) and then accessing the data using the PudlTabl output objects? Are you working in Python or another platform?

grgmiller commented 4 years ago

Thanks, I look forward to what @karldw and @gschivley have to say about this mapping. Since this is something that it looks like I may need to reconcile for my research, I'd potentially be interested in helping figure this out.

I am actually loading the CSVs directly into dataframes in python, but with a few tricks in place to manage the memory use of the dataframe. Basically my steps have been to:

Merge each of the state csv.gz files into a single dataframe, but only using specific fields relevant to my research questions fields = ['plant_id_eia', 'unitid', 'operating_datetime_utc', 'operating_time_hours', 'gross_load_mw', 'co2_mass_tons', 'co2_mass_measurement_code', 'heat_content_mmbtu']
Removing all rows where 'operating_time_hours equals zero, since basically this is when the plant was not operating so these observations are not useful

specifying the dtypes for each column on loading the csv to most efficiently use memory. Basically:

plant_id_eia                              uint32
unitid                                    object
operating_datetime_utc       datetime64[ns, UTC]
operating_time_hours                     float16
gross_load_mw                            float32
co2_mass_tons                            float32
co2_mass_measurement_code               category
heat_content_mmbtu                       float32

By using these steps, I'm able to get an entire year's worth of CEMS data for the entire country into a dataframe <500MB, which seems manageable for my computer with 32GB of RAM. I am less familiar with using SQL queries in python which is why I've stuck with dataframes for now, but if there is an advantage to using the sqlite db for this type of work I'm open to switching.

gschivley commented 4 years ago

I believe the facility_id is an EPA assigned ID, from a database that applies to many different types of facilities tracked by the agency. I think @gschivley posted something a while ago referring to an EPA ID reconciling tool? Maybe it also includes information on the boiler-generator-plant-unit-facility mappings?

Oh man, I can't think of what I (might?) have been talking about. The plain english guide to Rule 75 is a good resource if you haven't looked at it yet. You might also contact someone at EPA Clean Air Markets with specific questions or to make sure your assumptions are correct.

My philosophy on combining EIA/CEMS data is that you'll never get to 100% accuracy. But given the measurement errors (and possible bias) in the CEMS data, that's probably ok.

karldw commented 4 years ago

Hey @grgmiller, the easiest way to handle this is to avoid the whole issue by just aggregating to plants whenever possible. I don't know if that works for your project. You might also look at eGRID, which has some of these numbers on an annual basis.

See the notes in #360 on the variable names. Note that unitid maps to boiler_id a lot of the time, but not universally. Also, the two parts of a combined cycle natural gas unit may be reported unintuitively. (As @gschivley mentioned https://github.com/catalyst-cooperative/pudl/issues/245#issuecomment-449440776.) The ORISPL (plant_id_eia) codes map almost all the time, but #178 is still an open issue for the remaining few.

Also, I'd strongly recommend using the parquet files PUDL can generate for you, rather than dealing with CSVs. Parquet files are faster, have types, and play nicely with tools like dask.

@zaneselvans, I agree that the metadata on unit_id_epa needs tweaking if it still says EIA. See #360.

One last thing, it might be worth taking a look at the identifiers provided by EPA FRS. See the conversation in #178 for details.

zaneselvans commented 4 years ago

@grgmiller Maybe you've already looked at this but, if you need to load a substantial amount of the CEMS data into dataframes for analysis, I you might want to consider converting the datapackage output into an Apache Parquet dataset (there's a script bundled with PUDL named epacems_to_parquet that does this -- converting the whole dataset all years for all states takes like ~1hr on my laptop). The resulting dataset is partitioned on disk by state and year, and you can pretty easily select whatever subset of them makes sense for your work by using the read_parquet() method in either pandas or Dask dataframes. If you really need to operate on more data than you have memory, Dask will allow you to build up a queue of tasks, and then manage their serial execution for you -- ideally the resulting output is itself smaller than memory so you can continue to play with it / visualize it as you normally would.

The database (and the data packages, which essentially reflect a relational database structure) are well normalized, meaning there's very little duplication of data between the tables, which is good for storage and data integrity, but not the most readable output in the world. When I'm playing with the data I almost always use the output objects and do something like...

pudl_out = pudl.output.pudltabl.PudlTabl(pudl_engine)
bf_eia923 = pudl_out.bf_eia923() # for the boiler_fuel_eia923 table

The resulting dataframes include lots of additional data about the entities being referenced (plants, generators, etc.) instead of just their IDs, which is often useful for visualization. They can also be used to aggregate non-timeseries data into monthly or annual resolution automatically. And then there's a bunch of derived stuff of interest available in the pudl_out.mcoe() results...

zaneselvans commented 4 years ago

In particular, if what you're doing depends on heat rates, the mcoe routines do their best to estimate heat rates on a per-generator basis (which really means a per generation unit basis, often including multiple boilers and generators). We still need to do the work to connect those generation unit IDs up with the EPA CEMS data plant / unit IDs though so you can compare them directly. Hopefully the two datasets give very similar estimates!

grgmiller commented 4 years ago

Thank you all for the great feedback. I am unfamiliar with parquet (never have used it before) but if anyone has recommendations on resources to read to get started with using it in python I'd appreciate it. Also, if you have any example code that shows how to query parquet data and/or also use Dask (which I've been trying to figure out), that would be helpful.

@karldw ill consider aggregating to the plant level, although I am looking at marginal generation, so one unit at a plant might be marginal when another is not.

I also listened to yesterday's webinar about the 2nd annual emPOWER air data challenge (to which I am thinking of submitting a proposal), and they mentioned that they have a non-public spreadsheet of EIA-EPA matching that they would be willing to share, and also that EPA and EIA are meeting soon to discuss standardization of their IDs. I'll see if I can share this and potentially integrate it into PUDL.

zaneselvans commented 4 years ago

This series from one of the main pandas developers is a great overview of "modern" idomatic pandas, split into several thematic posts. The last post on scaling is helpful. The Dask documentation is also good.

Dask dataframes use Pandas dataframes internally to chop up bigger datasets into chunks of work that can fit into memory sequentially (serialization) or to hand off chunks of work to different distributed workers (e.g. on a Kubernetes cluster). Then the results are stiched back together and returned. Overall the interface should be very familiar if you're used to working with Pandas.

The read_parquet() function needs to be pointed at the top level directory containing the partitioned dataset (in this case, it'd probably be parquet/epacems inside your PUDL workspace. Then you use the filters and columns parameters to tell it which years and states, and what columns, you want it to actually read. Parquet files are optimized to make this kind of query-like behavior very fast. I haven't played with it in a while but maybe @karldw can post an example here? I find the syntax of the filters parameter a little complicated.

grgmiller commented 4 years ago

Thanks, these are some good resources! I'll read through them and see if I can get my datasets up and running without relying on CSVs.

To clarify, in order to use epacems_to_parquet, do I need to:

download all years of epacems,
run ETL,
and then run epacems_to_parquet?

It looks like the epacems_to_parquet requires a datapackage.json file as an input, and it also seems to require that the datapackage contains ALL states and ALL years of epacems. When I try and run epacems_to_parquet on a datapackage I created with just 2018 CEMS data for all states, I get the following error:

(pudl) C:\Users\gmiller7\Box\PUDL>epacems_to_parquet datapkg/2018_cems_data/epacems-eia/datapackage.json
2020-02-06 18:00:03 [    INFO] pudl:319 Checking for required EPA CEMS input files...
Traceback (most recent call last):
  File "C:\Users\gmiller7\anaconda3\envs\pudl\Scripts\epacems_to_parquet-script.py", line 9, in <module>
    sys.exit(main())
  File "C:\Users\gmiller7\anaconda3\envs\pudl\lib\site-packages\pudl\convert\epacems_to_parquet.py", line 326, in main
    pudl_settings=pudl_settings,
  File "C:\Users\gmiller7\anaconda3\envs\pudl\lib\site-packages\pudl\helpers.py", line 816, in verify_input_files
    raise FileNotFoundError("\n".join(err_msg))
FileNotFoundError: Missing data files for the following sources and years:
  EPA CEMS:
    Years:  2005, 2002, 1998, 2009, 2008, 2000, 2004, 2007, 1996, 2001, 1999, 2010, 1997, 1995, 2003, 2006
    States: MA, IN, NH, IA, PA, SC, CO, WI, WY, DE, VT, ID, NY, WV, ND, GA, KS, SD, NC, OH, AZ, CT, CA, MN, OK, UT, ME, OR, MS, MO, MI, NJ, VA, IL, RI, MT, DC, NM, TX, NV, TN, WA, AR, AL, KY, MD, FL, NE, LA

This is my datapackage.json file: https://ucdavis.box.com/s/0y7zzr3s7kdrfvlyi2b20ibm2a3pngc9

zaneselvans commented 4 years ago

Run epacems_to_parquet --help You can tell it to just use some of the years / states if you want. But if you're going to go to all the trouble of running the ETL, why no do everything :)

Also I am (I swear to god) pushing the data release tomorrow, and it will include all of the CEMS data, post ETL, and you can just download it and run the conversion script and it'll take like an hour.

karldw commented 4 years ago

@zaneselvans, I find it helpful to think of the filters option as operating only on file paths, without opening the files at all. So you can ask dask to load data from 2017 and 2018 with code like this:

import dask.dataframe as dd

cems_path = "pudl/parquet/epacems"

df = dd.read_parquet(cems_path, filters=[('year', '>', '2016')])

filters is nice because you can totally skip loading a file, but it's inflexible, and requires having your data partitioned in a way that aligns with the filter you want to apply. Another option, which will still be faster than reading the CSVs, is to "read" the whole dataset, e.g. df = dd.read_parquet(cems_path). Then you can do normal, pandas-style filtering. Dask only loads the data when it's actually required, and has an easy time parallelizing filters.

For example, you can't use filters to filter on operating_time_hours, but this works:

df = dd.read_parquet(cems_path).query("operating_time_hours > 0")

zaneselvans commented 4 years ago

omg I had no idea you could do that.

grgmiller commented 4 years ago

Thanks @zaneselvans and @karldw . I just downloaded all the epacems files from the new data release and successfully converted to parquet, which I was able to load into a dask dataframe using the filtering and querying tips you outlined.

Still waiting to hear back from EPA about their ID crosswalk document, but as soon as I have that and evaluate any gaps, I'll post the crosswalk info.

gschivley commented 4 years ago

I just got a crosswalk from someone, will take a look and let you know.

On Mon, Feb 10, 2020 at 4:03 PM Greg Miller notifications@github.com wrote:

Thanks @zaneselvans https://github.com/zaneselvans and @karldw https://github.com/karldw . I just downloaded all the epacems files from the new data release and successfully converted to parquet, which I was able to load into a dask dataframe using the filtering and querying tips you outlined.

Still waiting to hear back from EPA about their ID crosswalk document, but as soon as I have that and evaluate any gaps, I'll post the crosswalk info.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/catalyst-cooperative/pudl/issues/535?email_source=notifications&email_token=ACPERVHQ6ML65KPA6Q7WWKDRCG6LZA5CNFSM4KPO3HL2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELKIKHA#issuecomment-584353052, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACPERVC7RZR7S3W3RPXFCPDRCG6LZANCNFSM4KPO3HLQ .

gschivley commented 4 years ago

Jeff Quick from the Utah Geological Survey sent me his mapping of EIA/EPA plant codes. Looks like he didn't do a sub-facility level mapping but it might still be useful to someone. I turned the excel file into a Google Sheet, can forward the original email if anyone would like.

Jeff has also published some papers analyzing potential bias in CEMS CO2 emissions. Another source that's worth a read for background knowledge.

grgmiller commented 4 years ago

Hi all, I heard back from my contact at EPA and she shared the following excel file that they use for matching units across EPA and EIA data. She said that it was fine to share, and that they actually plan on publishing a final version sometime soon. CAMD EIA unit crosswalk 2018.xlsx

My contact, Justine Huetteman, did ask that the spreadsheet be cited as: United States Environmental Protection Agency (EPA). “Power Sector Emissions Data: EPA-EIA Crosswalk.” Washington, DC: Office of Atmospheric Programs, Clean Air Markets Division.

This is how I have interpreted the column headers in the spreadsheet:

ORIS Code refers to the EPA's plant id code
EIA ORIS refers to the EIA's Plant ID code (what PUDL calls plant_id_eia). This column is only filled in if the EIA's plant id differs from the EPA's plant id
Unit ID refers to the EPA's unitid
Generator ID refers to the EIA's Generator Id.
Boiler ID refers to the EIA's Boiler Id

Missing data notes:

It looks like all but 71 EPA unitids have been matched to EIA generator ids
Many of the boilerids have not been matched in this crosswalk
This crosswalk does not include any information about the EIA's unit code identifier.

There's also a lot of great notes/caveats about the matching in the Notes column.

Any thoughts on whether this would be incorporated into pudl.glue, and whether there might be an easy way to fill in the boiler_id column using the existing BGA data?

zaneselvans commented 3 years ago

Recent developments:

EPA Power Sector Data Crosswalk was published
See their repo here: https://github.com/USEPA/camd-eia-crosswalk

zaneselvans commented 2 years ago

@TrentonBush this issue / discussion seems related to your current work.

zaneselvans commented 1 year ago

It's still not perfect, but I think the integration of the EPA CAMD to EIA crosswalk into PUDL and our EPA CEMS outputs has gone a long way towards addressing this issue. I'm sure there will be additional iterations, probably with @TrentonBush and @aesharpe involved.

catalyst-cooperative / pudl