bsergi / Historical-PJM-Dispatch-Curves

Code for a simple model that attempts to recreate historical dispatch curves in PJM using data from EPA and PJM in order to estimate marginal emissions factors.
0 stars 0 forks source link

Speed up marginal cost calculations #1

Closed gschivley closed 5 years ago

gschivley commented 6 years ago

https://github.com/bsergi/Historical-PJM-Dispatch-Curves/blob/c473d4dafa82c7683a4353d54619d7417a382956/MarginalCostFunction.py#L91

Does this loop just assign fuel prices based on primary fuel type? If so, loop through the fuels and assign a price to every plant of a given fuel type rather than looping through every row of the dataframe (very slow).

Even better, if you have dataframes for each fuel price at whatever time interval (hourly would be easiest) then the price and CEMS dataframes can be merged using a pandas join.

Example fuel dataframe:

datetime fuel fuel_cost
2016-01-01 01:00:00 coal 23.00
2016-01-01 02:00:00 coal 23.00
2016-01-01 03:00:00 coal 23.00
... ... ...
2016-12-31 23:00:00 coal 21.00
2016-01-01 01:00:00 gas 3.00
2016-01-01 02:00:00 gas 3.00
... ... ...
bsergi commented 5 years ago

That isn't quite what we're doing--we like to give all coal, biomass, and oil plants one price based on their fuel type, but for gas plants we'd like to be able to be more granular and give them prices based on what state they are in. One thing I thought about doing was storing each potential cost in a separate column and then combining in one column, using fuel indicator to select (by row) which column is chosen for the final price column.

In R this would be easily accomplished with an ifelse statement, but so far I have not figured out the same functionality in pandas. Is there anything you'd recommend to do this?

gschivley commented 5 years ago

I think the goal should be to eliminate/reduce loops and explicit if/else statements wherever possible. One way to do slightly faster looping through rows would be to use the .apply() method (row wise with axis=1) and write a custom function that uses values in other columns for a row to determine what the appropriate fuel price is. I'd still guess that creating one large dataframe of all fuel prices and performing a join on time/fuel/state is going to be faster.

datetime state fuel fuel_cost
2016-01-01 01:00:00 AL coal 23.00
2016-01-01 02:00:00 AL coal 23.00
2016-01-01 03:00:00 AL coal 23.00
... ... ...
2016-01-01 01:00:00 TX coal 23.00
2016-01-01 02:00:00 TX coal 23.00
2016-01-01 03:00:00 TX coal 23.00
... ... ...
2016-01-01 01:00:00 AL gas 3.40
2016-01-01 02:00:00 AL gas 3.40
2016-01-01 03:00:00 AL gas 3.40
... ... ...
2016-01-01 01:00:00 TX gas 3.00
2016-01-01 02:00:00 TX gas 3.00
2016-01-01 03:00:00 TX gas 3.00
... ... ...

The custom function and .apply() method would look something like:

def get_fuel_price(row, coal_price_dict, gas_price_dict, ...):
    # This whole thing would probably be faster as nested dictionaries than if statements
    if row['fuel'] == 'coal':
        price = coal_price_dict[row['datetime'].dt.quarter]
    elif row['fuel'] == 'gas':
        state = row['state']
        price = gas_price_dict[state][row['datetime'].dt.quarter]  # nested dictionaries
    ...
    return price

df['fuel_price'] = df.apply(get_fuel_price, axis=1)
tdeetjen commented 5 years ago

Brian - here are a couple of ideas:

take a look at scipy.where() You can essentially do some nested if statements and scipy accomplishes this fairly quickly compared to actually writing a set of if/else statements. It works like the IF statement in Excel - e.g. IF(condition, do_this_if_true, do_this_if_false) where you can put another IF statement for the "do_this_if_false" argument.

Here is an example from my dispatch script (that I'm cleaning up and aiming to get on github pretty soon) that assigns VO&M costs to each generator based on their fuel type.

df is a pandas.DataFrame that holds information about each generator in

the columns including heat rate, nerc region, emissions rate, etc.

assign the VO&M values for each fuel type

vom_ng = 3.5 vom_coal = 4.6 vom_oil = 4.6 vom_nuc = 2.3 vom_bio = 4.2

create df column 'vom' using scipy.where to assign VO&M costs based on the

value in df.fuel_type df['vom'] = scipy.where(df.fuel_type=='Oil',vom_oil, scipy.where(df.fuel_type=='Biomass',vom_bio, scipy.where(df.fuel_type=='Gas',vom_ng, scipy.where(df.fuel_type=='Coal',vom_coal, scipy.where(df.fuel_type=='Nuclear',vom_nuc, 0.0)))))

For fuel prices, I'm doing something similar to what Greg is proposing because it can handle more complex operations. But depending on how detailed you need the fuel price assignments to be, scipy.where() might do the trick.

For your natural gas idea, maybe you could build a dataframe with columns for fuel, state, and price. You could use pandas.merge() to combine that dataframe with the other dataframe that you are storing your generator information in. You can merge on multiple columns, so if your generator dataframe and your fuelPrice dataframe both have columns 'fuel' and 'state', you could do something like:

generator_df = generator_df.merge(fuelPrices_df, left_index=True, how='left', on=['fuel', 'state'])

Thomas

Thomas A. Deetjen, Ph.D. Postdoctoral Researcher Carnegie Mellon University www.thomasdeetjen.com http://www.thomasdeetjen.com tdeetjen@gmail.com

On Tue, Jul 10, 2018 at 5:17 PM, Greg Schivley notifications@github.com wrote:

I think the goal should be to eliminate/reduce loops and explicit if/else statements wherever possible. One way to do slightly faster looping through rows would be to use the .apply() method (row wise with axis=1) and write a custom function that uses values in other columns for a row to determine what the appropriate fuel price is. I'd still guess that creating one large dataframe of all fuel prices and performing a join on time/fuel/state is going to be faster. datetime state fuel fuel_cost 2016-01-01 01:00:00 AL coal 23.00 2016-01-01 02:00:00 AL coal 23.00 2016-01-01 03:00:00 AL coal 23.00 ... ... ... 2016-01-01 01:00:00 TX coal 23.00 2016-01-01 02:00:00 TX coal 23.00 2016-01-01 03:00:00 TX coal 23.00 ... ... ... 2016-01-01 01:00:00 AL gas 3.40 2016-01-01 02:00:00 AL gas 3.40 2016-01-01 03:00:00 AL gas 3.40 ... ... ... 2016-01-01 01:00:00 TX gas 3.00 2016-01-01 02:00:00 TX gas 3.00 2016-01-01 03:00:00 TX gas 3.00 ... ... ...

The custom function and .apply() method would look something like:

def get_fuel_price(row, coal_price_dict, gas_price_dict, ...):

This whole thing would probably be faster as nested dictionaries than if statements

if row['fuel'] == 'coal':
    price = coal_price_dict[row['datetime'].dt.quarter]
elif row['fuel'] == 'gas':
    state = row['state']
    price = gas_price_dict[state][row['datetime'].dt.quarter]  # nested dictionaries
...
return price

df['fuel_price'] = df.apply(get_fuel_price, axis=1

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/bsergi/Historical-PJM-Dispatch-Curves/issues/1#issuecomment-403983897, or mute the thread https://github.com/notifications/unsubscribe-auth/AQC7fQWcedBgOCBLU7ovEDbnV3N6w9q7ks5uFSgDgaJpZM4UbHXo .

gschivley commented 5 years ago

I wrote up a quick example notebook in google's colab that compares the calculation time for a few methods.

Results:

The slow iteration time seems a little suspect but it's definitely the slowest method.

gschivley commented 5 years ago

@bsergi - I've been leaving out discussion of the different NG fuel cost methods. It looks like a single method is selected up front. If that's true then just assign whatever the appropriate price is to the dictionary or dataframe where prices are stored at the beginning of the code.

bsergi commented 5 years ago

Thanks gents, that was super helpful. Ended up going with a combination of a dictionary-based merge and using pd.where() and it's way faster than the looping we had before. See below for a snippet (will push the full code updates shortly):

def calcMarginalCosts(simpleDF,gasMethod,year,month,day,hour,fuelData, eGrid, monthlyPlantFuelCostData, hubPrices, henryHubPrices):

    # different formatting for extracting coal and gas fuel info
    dateGas = str(month)+'-'+str(day)+'-'+str(year)
    dateCoal = str(month)+'/'+str(day)+'/'+str(year)

    # average coal price for given month
    coalPrice = coalMethodOne(dateCoal, fuelData)

    # Henry Hub gas price for that date (used for Method 3)
    gasPrice = gasMethodThree(dateGas,eGrid,henryHubPrices)

    # estimated costs for oil and biomass 
    # note: nuclear cost is not a real estimate, currently a placeholder
    # source for biomass costs above
    # Also in date: OTHF (Other fuel) -- could be number of options
    # should include oil prices over time
    prices = pd.DataFrame(data={'Fuel': ['OIL', 'BIOMASS', 'NUCLEAR', 'COAL', 'GAS'], 
                                'Fuel Cost ($/MMBtu)': [10, 2, 5, coalPrice, gasPrice]})

    simpleDF = pd.merge(simpleDF, prices, on=['Fuel'], how='left', sort=False)      

    # gas prices: three methods available
    # Method 1: Plant-specific monthly average from EIA (under construction)
    # Method 2: State-based hub daily values
    # Method 3: Henry Hub daily values (baseline, constructed above)
    if gasMethod == 1:
        gasDF = monthlyPlantFuelCostData        
        # merge montly data set
        # use plant fuel cost average whenever data not missing         
    elif gasMethod == 2:
        # get hub prices for specified day
        gasDF = gasMethodTwo(dateGas, hubPrices)         
        # merge hub gas prices with main data set data frame     
        simpleDF = pd.merge(simpleDF, gasDF, on="Gas_Hubs", how="left", sort=False)        
        # replace gas plants with hub-based gas prices        
        simpleDF['Fuel Cost ($/MMBtu)'].where(simpleDF['Fuel'] != 'GAS', simpleDF['Gas price'], inplace=True)   

    # Calculate marginal cost of operation in $/MWh for given fuel prices for given time snapshot    
    simpleDF.loc[:, "Marginal Cost ($/MWh)"] = simpleDF["Fuel Cost ($/MMBtu)"] * simpleDF["Heat rate (MMBtu/MWh)"]  

    return(simpleDF)