Closed gschivley closed 5 years ago
That isn't quite what we're doing--we like to give all coal, biomass, and oil plants one price based on their fuel type, but for gas plants we'd like to be able to be more granular and give them prices based on what state they are in. One thing I thought about doing was storing each potential cost in a separate column and then combining in one column, using fuel indicator to select (by row) which column is chosen for the final price column.
In R this would be easily accomplished with an ifelse statement, but so far I have not figured out the same functionality in pandas. Is there anything you'd recommend to do this?
I think the goal should be to eliminate/reduce loops and explicit if/else statements wherever possible. One way to do slightly faster looping through rows would be to use the .apply()
method (row wise with axis=1
) and write a custom function that uses values in other columns for a row to determine what the appropriate fuel price is. I'd still guess that creating one large dataframe of all fuel prices and performing a join on time/fuel/state is going to be faster.
datetime | state | fuel | fuel_cost |
---|---|---|---|
2016-01-01 01:00:00 | AL | coal | 23.00 |
2016-01-01 02:00:00 | AL | coal | 23.00 |
2016-01-01 03:00:00 | AL | coal | 23.00 |
... | ... | ... | |
2016-01-01 01:00:00 | TX | coal | 23.00 |
2016-01-01 02:00:00 | TX | coal | 23.00 |
2016-01-01 03:00:00 | TX | coal | 23.00 |
... | ... | ... | |
2016-01-01 01:00:00 | AL | gas | 3.40 |
2016-01-01 02:00:00 | AL | gas | 3.40 |
2016-01-01 03:00:00 | AL | gas | 3.40 |
... | ... | ... | |
2016-01-01 01:00:00 | TX | gas | 3.00 |
2016-01-01 02:00:00 | TX | gas | 3.00 |
2016-01-01 03:00:00 | TX | gas | 3.00 |
... | ... | ... |
The custom function and .apply()
method would look something like:
def get_fuel_price(row, coal_price_dict, gas_price_dict, ...):
# This whole thing would probably be faster as nested dictionaries than if statements
if row['fuel'] == 'coal':
price = coal_price_dict[row['datetime'].dt.quarter]
elif row['fuel'] == 'gas':
state = row['state']
price = gas_price_dict[state][row['datetime'].dt.quarter] # nested dictionaries
...
return price
df['fuel_price'] = df.apply(get_fuel_price, axis=1)
Brian - here are a couple of ideas:
take a look at scipy.where() You can essentially do some nested if statements and scipy accomplishes this fairly quickly compared to actually writing a set of if/else statements. It works like the IF statement in Excel - e.g. IF(condition, do_this_if_true, do_this_if_false) where you can put another IF statement for the "do_this_if_false" argument.
Here is an example from my dispatch script (that I'm cleaning up and aiming to get on github pretty soon) that assigns VO&M costs to each generator based on their fuel type.
the columns including heat rate, nerc region, emissions rate, etc.
vom_ng = 3.5 vom_coal = 4.6 vom_oil = 4.6 vom_nuc = 2.3 vom_bio = 4.2
value in df.fuel_type df['vom'] = scipy.where(df.fuel_type=='Oil',vom_oil, scipy.where(df.fuel_type=='Biomass',vom_bio, scipy.where(df.fuel_type=='Gas',vom_ng, scipy.where(df.fuel_type=='Coal',vom_coal, scipy.where(df.fuel_type=='Nuclear',vom_nuc, 0.0)))))
For fuel prices, I'm doing something similar to what Greg is proposing because it can handle more complex operations. But depending on how detailed you need the fuel price assignments to be, scipy.where() might do the trick.
For your natural gas idea, maybe you could build a dataframe with columns for fuel, state, and price. You could use pandas.merge() to combine that dataframe with the other dataframe that you are storing your generator information in. You can merge on multiple columns, so if your generator dataframe and your fuelPrice dataframe both have columns 'fuel' and 'state', you could do something like:
generator_df = generator_df.merge(fuelPrices_df, left_index=True, how='left', on=['fuel', 'state'])
Thomas
Thomas A. Deetjen, Ph.D. Postdoctoral Researcher Carnegie Mellon University www.thomasdeetjen.com http://www.thomasdeetjen.com tdeetjen@gmail.com
On Tue, Jul 10, 2018 at 5:17 PM, Greg Schivley notifications@github.com wrote:
I think the goal should be to eliminate/reduce loops and explicit if/else statements wherever possible. One way to do slightly faster looping through rows would be to use the .apply() method (row wise with axis=1) and write a custom function that uses values in other columns for a row to determine what the appropriate fuel price is. I'd still guess that creating one large dataframe of all fuel prices and performing a join on time/fuel/state is going to be faster. datetime state fuel fuel_cost 2016-01-01 01:00:00 AL coal 23.00 2016-01-01 02:00:00 AL coal 23.00 2016-01-01 03:00:00 AL coal 23.00 ... ... ... 2016-01-01 01:00:00 TX coal 23.00 2016-01-01 02:00:00 TX coal 23.00 2016-01-01 03:00:00 TX coal 23.00 ... ... ... 2016-01-01 01:00:00 AL gas 3.40 2016-01-01 02:00:00 AL gas 3.40 2016-01-01 03:00:00 AL gas 3.40 ... ... ... 2016-01-01 01:00:00 TX gas 3.00 2016-01-01 02:00:00 TX gas 3.00 2016-01-01 03:00:00 TX gas 3.00 ... ... ...
The custom function and .apply() method would look something like:
def get_fuel_price(row, coal_price_dict, gas_price_dict, ...):
This whole thing would probably be faster as nested dictionaries than if statements
if row['fuel'] == 'coal': price = coal_price_dict[row['datetime'].dt.quarter] elif row['fuel'] == 'gas': state = row['state'] price = gas_price_dict[state][row['datetime'].dt.quarter] # nested dictionaries ... return price
df['fuel_price'] = df.apply(get_fuel_price, axis=1
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/bsergi/Historical-PJM-Dispatch-Curves/issues/1#issuecomment-403983897, or mute the thread https://github.com/notifications/unsubscribe-auth/AQC7fQWcedBgOCBLU7ovEDbnV3N6w9q7ks5uFSgDgaJpZM4UbHXo .
I wrote up a quick example notebook in google's colab that compares the calculation time for a few methods.
Results:
iterrows()
and a dictionary lookup of VOM (59 s)The slow iteration time seems a little suspect but it's definitely the slowest method.
@bsergi - I've been leaving out discussion of the different NG fuel cost methods. It looks like a single method is selected up front. If that's true then just assign whatever the appropriate price is to the dictionary or dataframe where prices are stored at the beginning of the code.
Thanks gents, that was super helpful. Ended up going with a combination of a dictionary-based merge and using pd.where() and it's way faster than the looping we had before. See below for a snippet (will push the full code updates shortly):
def calcMarginalCosts(simpleDF,gasMethod,year,month,day,hour,fuelData, eGrid, monthlyPlantFuelCostData, hubPrices, henryHubPrices):
# different formatting for extracting coal and gas fuel info
dateGas = str(month)+'-'+str(day)+'-'+str(year)
dateCoal = str(month)+'/'+str(day)+'/'+str(year)
# average coal price for given month
coalPrice = coalMethodOne(dateCoal, fuelData)
# Henry Hub gas price for that date (used for Method 3)
gasPrice = gasMethodThree(dateGas,eGrid,henryHubPrices)
# estimated costs for oil and biomass
# note: nuclear cost is not a real estimate, currently a placeholder
# source for biomass costs above
# Also in date: OTHF (Other fuel) -- could be number of options
# should include oil prices over time
prices = pd.DataFrame(data={'Fuel': ['OIL', 'BIOMASS', 'NUCLEAR', 'COAL', 'GAS'],
'Fuel Cost ($/MMBtu)': [10, 2, 5, coalPrice, gasPrice]})
simpleDF = pd.merge(simpleDF, prices, on=['Fuel'], how='left', sort=False)
# gas prices: three methods available
# Method 1: Plant-specific monthly average from EIA (under construction)
# Method 2: State-based hub daily values
# Method 3: Henry Hub daily values (baseline, constructed above)
if gasMethod == 1:
gasDF = monthlyPlantFuelCostData
# merge montly data set
# use plant fuel cost average whenever data not missing
elif gasMethod == 2:
# get hub prices for specified day
gasDF = gasMethodTwo(dateGas, hubPrices)
# merge hub gas prices with main data set data frame
simpleDF = pd.merge(simpleDF, gasDF, on="Gas_Hubs", how="left", sort=False)
# replace gas plants with hub-based gas prices
simpleDF['Fuel Cost ($/MMBtu)'].where(simpleDF['Fuel'] != 'GAS', simpleDF['Gas price'], inplace=True)
# Calculate marginal cost of operation in $/MWh for given fuel prices for given time snapshot
simpleDF.loc[:, "Marginal Cost ($/MWh)"] = simpleDF["Fuel Cost ($/MMBtu)"] * simpleDF["Heat rate (MMBtu/MWh)"]
return(simpleDF)
https://github.com/bsergi/Historical-PJM-Dispatch-Curves/blob/c473d4dafa82c7683a4353d54619d7417a382956/MarginalCostFunction.py#L91
Does this loop just assign fuel prices based on primary fuel type? If so, loop through the fuels and assign a price to every plant of a given fuel type rather than looping through every row of the dataframe (very slow).
Even better, if you have dataframes for each fuel price at whatever time interval (hourly would be easiest) then the price and CEMS dataframes can be merged using a pandas join.
Example fuel dataframe: