Closed aesharpe closed 1 week ago
How should we handle the time column based on the answer we got from Chris:
Because uniform, complete, continuous time series are nicer to work with in many contexts, we're planning to leave in the hours beyond 8760 in leap years, but give the data NA values in our outputs. Does that seem acceptable?
This is supposed to support RA work and that is never done with continual data time series. Almost all modeling done in the space uses 8760 and we’d prefer it to remain that way as it is. For almost all purposes in modeling it makes very little difference (production cost, capacity expansion, resource adequacy).
If we convert the index to a time it will probably add in those leap year that we don't want. What do you think is the best solution here @e-belfer @zaneselvans?
Given his response I think we want to include the first 8760 hours of the year, so in leap years we'd be missing Dec 31st.
The other option is to muck around with offsetting and drop the leap day, not sure if that's preferable. As discussed here: https://stackoverflow.com/questions/68316510/create-pandas-datetimeindex-without-leap-day
Given his response I think we want to include the first 8760 hours of the year, so in leap years we'd be missing Dec 31st.
Should we not remove February 29th?
If we were missing Feb 29th and someone converted the timestamps to hour-of-year then I would think they'd end up with a gap in Feb, and some hours of the year that were between 8760 and 8784
What if we had two columns, a date time and an hours of year column?
Or maybe we just have a year column and an hours of year column?
Another alternative would be to have a year column and an hour-of-year column (1-8760)
Loving that synchronicity haha. That feels like it would avoid any weirdness and preserve the original intent of the data. We can explain in the data table description.
But without a datetime column the data won't be easy to combine with other tables that have datetimes, and you won't be able to use the datetime accessors, etc. small integers don't take up much space in parquet, so having all three of a datetime (for the first 8760 hours), a year, and an hour-of-year column could also be fine.
Another pretty important reason to stick with the "first 8760 hours" approach is, these are real dates and times that need to correlate with other datasets. Like, if someone wanted to see what weather conditions these renewable outputs corresponded to and they looked back at the HRRR model outputs, the timestamps in this data and the model need to line up. And the README clearly states that the day that's missing is Dec 31st:
Data is provided as an 8,760 for all years. For leap years, the final day (December 31st) is not included.
If we're going to include an hour-of-year we might want to ask someone who does this kind of modeling whether 8760 data is typically 0-indexed or 1-indexed too. Like, hours 0-8759 or hours 1-8760? I'm not sure. And there's no built-in hour-of-year datetime accessor so we'll have to construct it like
# create an hourly datetime index for 2024
hourly_index = pd.date_range(start="2024-01-01", end="2024-12-31 23:00", freq="h")
# calculate the hour of the year from hourly_index ranging from 0 to 8759
hour_of_year = hourly_index.hour + (hourly_index.dayofyear-1) * 24
whether 8760 data is typically 0-indexed or 1-indexed too
Can you email Chris about this?
Another thing to note:
The README
says that:
The hourly capacity factors are written as decimals with values from 0.0 to 1.0.
However, the data show a range of values over 1 and over 100. Here's an example from the solar table excluding rows where the capacity factor is not 0:
print(len(df_out[(df_out["capacity_factor"]!=0) & (df_out["capacity_factor"]<1)]))
print(len(df_out[(df_out["capacity_factor"]>1) & (df_out["capacity_factor"]<100)]))
print(len(df_out[df_out["capacity_factor"]>100]))
2050181 # Rows between 0-1
69708380 # Rows between 1-100
87123 # Rows over 100
I'm not sure how to interpret or fix this.
But without a datetime column the data won't be easy to combine with other tables that have datetimes, and you won't be able to use the datetime accessors, etc. small integers don't take up much space in parquet, so having all three of a datetime (for the first 8760 hours), a year, and an hour-of-year column could also be fine
@zaneselvans can you clarify what you mean by "having all three of a datetime (for the first 8760 hours)?
Do you mean:
I was suggesting:
That's not great news on the capacity factor values! I'd imagined that he opened the sample data in a spreadsheet and it accidentally got cast to percentages because all the sample values were between 0 and 100, or something like that. If there's ambiguity as to whether it's a percentage or a value between 0 and 1, and if there are really values beyond 100, then we probably need to ask for guidance or another handoff of data that doesn't have this problem.
Can you plot a histogram of the values to see how crazy they are? Is there uniformity within individual counties or years of data? Is there any chance that we've introduced this weirdness?
It looks like about half of the values are 0-1, and half of them are 0-100, even within a particular county and year.
import random
csv_path = "/Users/zane/code/catalyst/pudl-input/local/vceregen/2023/Fixed_SolarPV_Lat_UPV_county.csv"
pv = pd.read_csv(csv_path)
random_col = random.choice(pv.columns[1:-1])
plt.hist(pv[random_col], bins=100, range=(0, 100))
plt.title(f"{random_col} (max value: {pv[random_col].max():.2f})")
plt.grid()
plt.show();
But I guess we would expect on average 50% of solar hours to have generation of zero. So maybe it's just that all of the values are between 0 and 100, and so percentages. The wind data seems in line with that:
import random
year = 2023
gen_type = "onwind"
files = {
"pv": "Fixed_SolarPV_Lat_UPV_county",
"onwind": "Wind_Power_100m_Onshore_county",
"offwind": "Wind_Power_140m_Offshore_county",
}
csv_path = f"/Users/zane/code/catalyst/pudl-input/local/vceregen/{year}/{files[gen_type]}.csv"
df = pd.read_csv(csv_path)
random_col = random.choice(df.columns[1:-1])
plt.hist(df[random_col], bins=100, range=(0, 100))
plt.title(f"{random_col}; gen: {gen_type}; max: {df[random_col].max():.2f}")
plt.grid()
plt.show();
- Make a year column that is an integer like 2020
I just found out there's a hidden report_year
column in amongst allllll the county columns...
But I guess we would expect on average 50% of solar hours to have generation of zero. So maybe it's just that all of the values are between 0 and 100, and so percentages. The wind data seems in line with that:
ok great. I'll assume that it's all reported as a % (0-100) and convert it to a fraction (0-1). How should we handle the values > 100?
I'm not sure how you're extracting the CSVs right now, but if it's using our CSV Extractor, I think we probably add the report_year
based on the year partition value automatically.
On the values > 100 what do they look like?
Also it looks like the CSVs contain an unlabeled index column that starts at 1 and goes to 8760, so it sounds like we've already got our year and hour-of year columns and can use them to create a UTC datetime column 😄
Also it looks like the CSVs contain an unlabeled index column that starts at 1 and goes to 8760, so it sounds like we've already got our year and hour-of year columns and can use them to create a UTC datetime column 😄
oh I did notice that, but I thought it was important to create these columns based on the datetime column as per your clarification above.
Data Structure Description
Under our current system, this data will be published as a single Apache Parquet file (as with all our other hourly tables). Internally it would be partitioned by year or possibly [year, state] into a number of distinct row groups for more efficient querying.
Primary Key / Index Columns:
Possible additional columns:
This structure & format is intended to maximize programmatic usability. It minimizes the need for additional external information or context to make the data usable and allows it to be queried efficiently remotely, and will let us and others build on top of it with minimal additional glue required – e.g. through the interactive dashboard mentioned below. With this structure, 5 years of data would have ~140M rows and the one Parquet file will probably be a few GB on disk.
Data Transformations
Data Sanity Checks