catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 105 forks source link

Add hourly CEMS #171

Closed karldw closed 5 years ago

karldw commented 6 years ago

I'm going to be using hourly CEMS for one of my projects, and I thought I might as well help add it to PUDL while I'm at it. Would that be helpful?

In #142 Zane mentioned issues with the file sizes and unzipping. The zipped files are 7.7 GB, considerably larger than all the other data sources.

Issues to consider:

zaneselvans commented 6 years ago

We're definitely still interested in including the hourly CEMS data. We haven't talked about whether or not to integrate it into pudl or a different database. You were saying that the files don't need to be unzipped for reading, and given that we're probably going to be reading them in to view them anyway, leaving them zipped on disk seems like that's best. However, it seems like the time required to load the full dataset will be substantial, and done fastest with the postgres-copy command from a CSV. So maybe we read directly from the zipped archives one state or state+year at a time, process that dataframe, and then dump it out to CSV temporarily for loading with postgres copy? Is there any reason why doing it in batches like that wouldn't work? Do we imagine inter-state connections that we would want to check before populating the database?

We haven't tried to be particularly conscious of our memory usage thus far, and have been using dataframes containing large portions of the whole dataset. With the CEMS it seems like that probably won't work. Do you have any sense of what we should be watching out for, or how to work around memory limitations?

What are the downsides of having a separate databsae? There are many fields in the CEMS that ought to come directly from our EIA data -- duplicating that information in more than one place seems like it might get out of sync. Other possible data quality tests:

Need to finish the datastore update & management functions for CEMS.

karldw commented 6 years ago

I think working in batches makes a lot of sense, and working that way would definitely help for memory use. I don't have a deep sense of what we should do to avoid using too much memory—working in small chunks is the only approach I know.

Do we want to use the ORISPL codes from the EIA data as a foreign key for the CEMS? If so, they apparently have to be in the same database.

For the quality tests, it might easier to load the data into postgres, then check quality of the loaded data (particularly if we're comparing the different datasets).

karldw commented 6 years ago

Do you have work in progress on the datastore update and management? If not, I can pull that together too.

zaneselvans commented 6 years ago

I did a little bit of work on the datastore update script a while ago, but didn't finish it -- I think it was mostly having to do with the MSHA data. Whatever changes I've made I have checked in.

I haven't played with the CEMS in a while -- is the ORISPL a separate identifier from the EIA Plant ID? Is this a more widely applied identifier?

karldw commented 6 years ago

ORISPL is the same as EPA plant ID, at least most of the time.

gschivley commented 6 years ago

In case others aren't aware, EPA will update historical CEMS data on the FTP site. It's harder to see on the newftp.epa.gov server but sometimes files from a year or two back will be modified. For example, some 2014 Ohio files were updated in December 2017.

I've dealt with this by tracking filenames and datetime modified (obtained from the ftp server) in a separate table. The eventual plan is to update data in the main table and move the old data to a separate table so that we can track and compare historical changes. Would love to hear thoughts on how best to implement something like this.

zaneselvans commented 6 years ago

@gschivley We've heard that there are more irregularities in the hourly than daily CEMS data, but one of the things we're hoping to be able to do is infer (or at least constrain) unpublished plant operational characteristics from the recorded operational patterns (ramp rates, etc.) so the hourly information is very valuable to us -- we want to be able to value the flexibility characteristics of different generators, and also observe the changes in operational patterns for supposedly "baseload" plants (which is what the WIEB interns worked on last summer).

Have you had any luck with timely downloads of the files from the FTP site? It seems painfully slow right now. Karl had some ideas about how to use a single FTP session for bulk download, rather than re-initiating it over and over again, but is that really the bottleneck? Or could one do concurrent downloads of several files at once? Could we use something like rsync over FTP? We've been burned a couple of times by long after the fact revisions of the EIA spreadsheet structures, without any warning. What are your thoughts on the value of storing the historical data? What kinds of revisions do they typically make?

gschivley commented 6 years ago

@zaneselvans

Agree that the hourly data is useful for inferring characteristics. Can also be used for identifying cold starts, etc. I'm using it for predicting marginal emissions and possibly marginal generators.

Things I can answer:

Things I don't have good answers to:

zaneselvans commented 6 years ago

It sounds like WIEB is working with Stanford folks again this summer on the hourly CEMS data, and trying to create a web dashboard. A webinar on the work is posted up here:

https://www.youtube.com/watch?v=eWTH3vZJOM4

It seems like maybe we should get in touch with the interns and see if we can't all just collaborate to create a re-usable backend for it inside PUDL...

karldw commented 6 years ago

A couple more minor details:

zaneselvans commented 6 years ago

Sadly pretty much everything is currently in imperial units (but hey at least we capacity in MW and not horsepower), so we probably want to leave it in imperial for now. And so many people are accustomed to these units in this space that it might not even be a good idea to migrate it all (though I'd love to).

On the copy from yes, it's all going to disk right now -- incredibly, this was far and away the fastest mechanism for getting it all into the DB that we could find. If there's a convenient way to avoid that step, and go directly from memory into the DB, that would be great -- and it seems like it ought to be faster -- but going directly from the pandas dataframes was certainly not fast. Do you have another way in mind?

zaneselvans commented 6 years ago

We did briefly look at using odo to do in-memory conversion from the DataFrame to postgrest directly.

gschivley commented 6 years ago

I haven't tried it but you can output the df to a StringIO object in memory and use copy from with that.

I've been trying to use the same method with sqlalchemy-copy and am not having any luck. No errors, just no data in the table.

See any obvious reason the code below wouldn't work?

meta = sqlalchemy.MetaData()
cems_unit = sqlalchemy.Table(self.full_cems_table_name, meta, autoload=True, autoload_with=conn)

output = StringIO()
df.to_csv(output, index=False)
postgres_copy.copy_from(output, cems_unit,
                        self.conn, columns=tuple(df.columns),
                        format='csv', header=True, delimiter=',')
karldw commented 6 years ago

@gschivley, just what I had in mind.

There are even weirder options, like using postgres' binary copy from: https://stackoverflow.com/a/8150329. There's a ruby library for this, but I can't find a python one.

gschivley commented 6 years ago

What do you guys think of multiplying the op_time and gload_mw columns to get gload_mwh? It's easy to accidentally use gload_mw as a generation value rather than a power value.

zaneselvans commented 6 years ago

@gschivley Did you get the StringIO thing working? Here's an example of someone getting this to work.

Thus far we've mostly been keeping just the original data in PUDL, and creating output functions in pudl/outputs.py that generate derived values like you're talking about for the MWh of generation, potentially merging several tables together to create something coherent for a user, but not well normalized for a database. But it's been an ongoing debate between myself and @cmgosnell as to whether and how much derived information we want in the database itself. We just added the boiler generator associations, which are derived, but pretty key to the whole EIA side of things working.

It sounds like postgresql views allow you to store calculated values that links them back to the original data they're derived from, rather than as static entries in the DB, which seems like it might be the ideal arrangement for this kind of obviously useful derived value, but we haven't dug into it very far.

karldw commented 6 years ago

Currently I've set the model to include columns for CEMS heatrate and a bad_heatrate flag, but these are definitely derived. Should I take them out?

karldw commented 6 years ago

Another small thing, I think we're going to want indexes for the CEMS data, but it will be faster to copy the data into postgres unindexed, then add the index later. Do you have thoughts on how you want to structure the code for this kind of post-load processing?

gschivley commented 6 years ago

My concern with leaving gload_mw and op_time is that it's difficult to document what each of them mean and that they need to be multiplied together to get MWh of generation within the hour. Having made this mistake myself (fortunately for something non-critical) I'm pretty sure someone else will make it again.

I gave up on getting the StringIO method to work with postgres_copy last night and instead spent several hours fighting with datetime columns in ERCOT regional load data.

karldw commented 6 years ago

I did a quick take on the StringIO method, but I think it doesn't respect the SQLAlchemy model, which is a problem.

def _csv_dump_load2(df, table_name, engine, csvdir='', keep_csv=True):
    """
    Write a dataframe to CSV and load it into postgresql using COPY FROM.

    The fastest way to load a bunch of records is using the database's native
    text file copy function.  This function dumps a given dataframe out to a
    CSV file, and then loads it into the specified table using a sqlalchemy
    raw_connection.

    Args:
        df (pandas.DataFrame): The DataFrame which is to be dumped to CSV and
            loaded into the database. All DataFrame columns must have exactly
            the same names as the database fields they are meant to populate,
            and all column data types must be directly compatible with the
            database fields they are meant to populate. Do any cleanup before
            you call this function.
        table_name (str): The exact name of the database table which the
            DataFrame df is going to be used to populate. It will be used both
            to look up an SQLAlchemy table object in the PUDLBase metadata
            object, and to name the CSV file.
        engine (sqlalchemy.engine): SQLAlchemy database engine, which will be
            used to pull the CSV output into the database.
        csvdir (str): Path to the directory into which the CSV files should be
            saved if they are being kept (Otherwise, it's not written to disk.)
        keep_csv (bool): True if the CSV output should be saved after the data
            has been loaded into the database. False if they should be deleted.

    Returns: None

    Drawn heavily from https://stackoverflow.com/a/44181653
    """
    import io
    import shutil
    # Create table (but don't insert data yet)
    # TODO: if I use this function, the if_exists should probably be append
    df[:0].to_sql(table_name, engine, if_exists='replace')

    # Prepare data
    output = io.StringIO()
    df.to_csv(output, sep=",", index=False, header=False, encoding='utf8')
    output.seek(0)

    # Insert data
    connection = engine.raw_connection()
    cursor = connection.cursor()
    cursor.copy_from(output, table_name, sep=',', null='', columns = tuple(df.columns))
    connection.commit()
    cursor.close()

    if keep_csv:
        csvfile = os.path.join(csvdir, table_name + '.csv')
        output.seek(0)
        shutil.copyfileobj(output, csvfile)
karldw commented 6 years ago

Coming back here, now that #181 is merged. Here are the remaining things, in no particular order. Let me know if you have other thoughts!

Variables and variable names:

Dates

Other

Performance

IDs

Edit: added performance, gross load, change time priorities

zaneselvans commented 6 years ago

@cmgosnell Do you want to look at integrating the harvesting of EIA Plant IDs from the CEMS data, like you've done with all of the EIA tables? Also, we need to understand what they mean by "unit id" in CEMS -- is that gong to correspond to the generator_id that EIA has? We should also do some sanity checking -- for the years with overlap, we should look at each plant that exists in CEMS and its state, name, and associated unit IDs, and see if those line up with the EIA 860 values.

Also, @karldw thus far we've tried to be pretty strict about keeping the database well normalized, which would in the case of this table mean stripping out the state (since it is stored in the EIA 860 plants table, accessible by plant ID) and ensuring that there's only one date being stored in each record -- if we go with the intervals, it'd be easy to extract just the date time of the start of the interval in a postgres view, or an output dataframe. With the smaller data sources that we've been using so far, we've created output routines that return dataframes that have many useful and interesting things merged in (and calculated) for playing with in dataframeland. @cmgosnell has wanted to bring some of these things into the database, and that sounds good to me as long as they're automatically linked back to the original values rather than being calculated and then hard coded. At the same time, what's the usage pattern that you're anticipating? Pretty much all of the analysis that we've done has been happening in dataframes -- so the database has been acting much more like an archival store of relatively static information than a place where calculations and analysis happens.

karldw commented 6 years ago

Dropping the state sounds good. Should I do that now or after plant IDs are more lined up?

I'm going to be doing a lot of subsetting by date, and it seemed less fiddly, both for coding and indexing, to have a date column than repeatedly pull the date part out. But maybe I just need to step up my views game.

zaneselvans commented 6 years ago

For now I think we need to leave the state in, so we can compare the plant_id / state combinations that we're getting in CEMS to what we have from the EIA860, and find any inconsistencies, and flag them. Previously we've had the luxury of doing all this data cleaning and validation type stuff in one big dataframe... but here we're going to have to figure something else out -- either accumulating a list of them during the ingest process, or doing it after the fact. For now I think we probably want to just work with the data in the DB, since it'll be faster and that's where it all is...

This seems like a great opportunity to figure out how to create views! We haven't done it yet either, but this seems like about as simple a case as there could be. :-)

zaneselvans commented 6 years ago

So, we know that the real information content of this dataset is much more like 7GB than 150GB, based on the zipped size -- probably it's even smaller than that since there's no inter-file deduplication happening. A lot of this is compressibility is because there's a huge number of repeated values in the data -- states, ORISPL codes, time and date stamps every hour on the hour, the flag columns for estimated vs. measured, etc -- right now all of that is being stored as unique text or numerical data, which creates a huge amount of unnecessary duplication. There are only 200,000 different hourly timestamps across the dataset, but we've stored 681 million of them. I don't know how much of this deduplication of real stored data postgres does on its own, or whether we need to tell it to explicitly use ENUM types for the values that have a finite set of values they can take on, but I imagine this is a common problem that has already been solved very well by others with similarly redundant data being stored... so hopefully we can just use whatever the right solution is to shrink this all down to something more like 15 instead of 150GB...

zaneselvans commented 6 years ago

In other news, I selected all unique pairs of (orispl_code, state) from the full hourly_emissions_epacems table, to see if there were any cases in which the state associated with a code changed (the select took 4 minutes and 20 seconds to run), and found that there were no internal inconsistencies of that type. Yay!

gschivley commented 6 years ago

You guys are making great progress! I'm swamped with other stuff under deadline but will add comments where I think they can be helpful.

Also, we need to understand what they mean by "unit id" in CEMS -- is that gong to correspond to the generator_id that EIA has?

I believe these can be different. EPA tracks emission stacks, which can be shared between generating units/boilers. Or multiple boilers with different stacks can feed into a single generator. Permutations like that.

In other news, I selected all unique pairs of (orispl_code, state) from the full hourly_emissions_epacems table, to see if there were any cases in which the state associated with a code changed (the select took 4 minutes and 20 seconds to run), and found that there were no internal inconsistencies of that type.

I might have this backwards but there are cases where EIA (or EPA?) will change an orispl code or keep it the same when it moves. e.g. a diesel generator might move locations but keep the same code. Or a facility may be converted from coal to NG and keep the same code from only one of the agencies. I can't remember the exact circumstances that have been explained to me but it's something like that.

zaneselvans commented 6 years ago

There are a few cases of ID changes or other weirdness (like, 100) that @karldw linked to, but I think for the most part things match up EIA to ORISPL. Initially I think we'll just create a foreign key relationship and probably that small number of IDs that are weird won't even show up for the most part -- it looks like there are 10,000+ ID + State combos in EIA860, but only 1750 or so in CEMS?

On the generator ID thing... ugh. That's too bad. But, we have an algorithm which can probably identify which EPA units should be associated with which EIA units based on the mutually reported fields (heat content consumed and electricity generated). We developed it to hook the FERC plants up with the corresponding EIA generators, and the application the EPA + CEMS will be much cleaner, since we have the plant_id/orispl_code to constrain the matching.

karldw commented 5 years ago

Some performance notes:

It might be worth adding some of these settings to the setup docs, maybe with a note about the risks of lower WAL settings.

karldw commented 5 years ago

CC @lpreonas about timezones

karldw commented 5 years ago

Two things about CEMS generation numbers that should probably be fixed:

gschivley commented 5 years ago
  • Many (all?) of the NAs in generation are actually zeros (presumably they were omitted for sparsity?) Maybe @gschivley can weigh in here.

All empty generation values will be converted to nan. I checked out the January 2018 data for Alabama and ~45% of rows have nan generation and 0 for OP_TIME. But ~0.5% of rows are nan generation and positive OP_TIME.

Looks like three generators are responsible for no generation but positive OP_TIME in this file (7, 52140, 880041). The first two are classified by EIA860 as CHP and the third isn't included in 860. So maybe these plants are only producing heat during those hours.

To check this I did a calculation of btu/lb steam for each of the plants. Rule of thumb is ~1,000 btu/lb steam, which checks out with what's going on here.

image

tl;dr - Seems reasonable to assume that all nan generation values should be zero. EIA just omits values rather than putting 0 and pandas converts to nan.

zaneselvans commented 5 years ago

Is the the heat_content_mmbtu column a similar situation? (where NaN should be zero). Do we have a key somewhere describing what exactly the meaning of each of these columns is? What's the steam load? I was thinking that maybe it was heat (rather than electrical) output, but there are a bunch of places with zero gross_load_mw and zero steam_load_1000_lb and non-zero heat_content_mmbtu so maybe that's not right.

gschivley commented 5 years ago

Do you mean heat input (mmbtu)? I'd assume that blanks (nan) are indeed zero.

What's the steam load? I was thinking that maybe it was heat (rather than electrical) output, but there are a bunch of places with zero gross_load_mw and zero steam_load_1000_lb and non-zero heat_content_mmbtu so maybe that's not right.

I believe steam load is a measure of the useful steam energy produced. Are you seeing startup events? I've found lots of them where several hours or more of heat input & emissions are recorded without any generation. Not sure about the steam load but I assume it would have also been zero.

karldw commented 5 years ago

Note: I edited the CEMS to-do list above: https://github.com/catalyst-cooperative/pudl/issues/171#issuecomment-398269132

karldw commented 5 years ago

Closing this because all the issues we discussed here have their own issues (#178, #207, #212, #245, #258, #260)