Clean, transform, and sanity check the `vceregen` data

aesharpe commented 1 month ago

Data Structure Description

Under our current system, this data will be published as a single Apache Parquet file (as with all our other hourly tables). Internally it would be partitioned by year or possibly [year, state] into a number of distinct row groups for more efficient querying.

Primary Key / Index Columns:

Hourly UTC Timestamp
5-digit county FIPS ID (a string of digits, including leading zeros, for easy integration with geospatial & other data)
3 data columns, each containing floating point values between 0.0 & 1.0 (could be float16?)
- Onshore Wind
- Offshore Wind
- Fixed-tilt Solar PV

Possible additional columns:

2-letter state abbreviation for ergonomic filtering & legibility (categorical string, e.g. “CA”)
Canonical county name for legibility (categorical string, e.g. “San Bernardino”)
Approximate county time zone for use in time localization (categorical string, e.g. “America/Los_Angeles”)
Longitude & Latitude for the center of the county, for spatial queries (low precision floats or GeoPoint)

This structure & format is intended to maximize programmatic usability. It minimizes the need for additional external information or context to make the data usable and allows it to be queried efficiently remotely, and will let us and others build on top of it with minimal additional glue required – e.g. through the interactive dashboard mentioned below. With this structure, 5 years of data would have ~140M rows and the one Parquet file will probably be a few GB on disk.

Data Transformations

[x] Add missing leading zeros to provided county FIPS IDs
[x] Add UTC timestamp column based on row number to be used as part of the index
[x] Pivot county column headers into a single county column.
[x] Map county_state name strings to the provided county FIPS IDs
[x] Divide values by 100 to convert percentages to capacity factors if necessary

Data Sanity Checks

[ ] Do the provided FIPS codes and their associated county & state names match current US Census data? In the sample data the FIPS codes have lost their leading zeros.
[ ] Do the provided county lon/lat values fall within the expected county boundaries in US Census/TIGER or DP1 data?
[ ] Plot histograms of capacity factors for the 3 different generator types. Do they look right? Sample data seems to be in percentages (0.0-100.0) rather than fractions (0.0-1.0).
[ ] Do the capacity factors vary diurnally and seasonally as expected?

aesharpe commented 3 weeks ago

How should we handle the time column based on the answer we got from Chris:

Because uniform, complete, continuous time series are nicer to work with in many contexts, we're planning to leave in the hours beyond 8760 in leap years, but give the data NA values in our outputs. Does that seem acceptable?

This is supposed to support RA work and that is never done with continual data time series. Almost all modeling done in the space uses 8760 and we’d prefer it to remain that way as it is. For almost all purposes in modeling it makes very little difference (production cost, capacity expansion, resource adequacy).

If we convert the index to a time it will probably add in those leap year that we don't want. What do you think is the best solution here @e-belfer @zaneselvans?

zaneselvans commented 3 weeks ago

Given his response I think we want to include the first 8760 hours of the year, so in leap years we'd be missing Dec 31st.

e-belfer commented 3 weeks ago

The other option is to muck around with offsetting and drop the leap day, not sure if that's preferable. As discussed here: https://stackoverflow.com/questions/68316510/create-pandas-datetimeindex-without-leap-day

aesharpe commented 3 weeks ago

Given his response I think we want to include the first 8760 hours of the year, so in leap years we'd be missing Dec 31st.

Should we not remove February 29th?

zaneselvans commented 3 weeks ago

If we were missing Feb 29th and someone converted the timestamps to hour-of-year then I would think they'd end up with a gap in Feb, and some hours of the year that were between 8760 and 8784

aesharpe commented 3 weeks ago

What if we had two columns, a date time and an hours of year column?

aesharpe commented 3 weeks ago

Or maybe we just have a year column and an hours of year column?

zaneselvans commented 3 weeks ago

Another alternative would be to have a year column and an hour-of-year column (1-8760)

aesharpe commented 3 weeks ago

Loving that synchronicity haha. That feels like it would avoid any weirdness and preserve the original intent of the data. We can explain in the data table description.

zaneselvans commented 3 weeks ago

But without a datetime column the data won't be easy to combine with other tables that have datetimes, and you won't be able to use the datetime accessors, etc. small integers don't take up much space in parquet, so having all three of a datetime (for the first 8760 hours), a year, and an hour-of-year column could also be fine.

zaneselvans commented 3 weeks ago

Another pretty important reason to stick with the "first 8760 hours" approach is, these are real dates and times that need to correlate with other datasets. Like, if someone wanted to see what weather conditions these renewable outputs corresponded to and they looked back at the HRRR model outputs, the timestamps in this data and the model need to line up. And the README clearly states that the day that's missing is Dec 31st:

Data is provided as an 8,760 for all years. For leap years, the final day (December 31st) is not included.

zaneselvans commented 3 weeks ago

If we're going to include an hour-of-year we might want to ask someone who does this kind of modeling whether 8760 data is typically 0-indexed or 1-indexed too. Like, hours 0-8759 or hours 1-8760? I'm not sure. And there's no built-in hour-of-year datetime accessor so we'll have to construct it like

# create an hourly datetime index for 2024
hourly_index = pd.date_range(start="2024-01-01", end="2024-12-31 23:00", freq="h")
# calculate the hour of the year from hourly_index ranging from 0 to 8759
hour_of_year = hourly_index.hour + (hourly_index.dayofyear-1) * 24

aesharpe commented 3 weeks ago

whether 8760 data is typically 0-indexed or 1-indexed too

Can you email Chris about this?

aesharpe commented 3 weeks ago

Another thing to note:

The README says that:

The hourly capacity factors are written as decimals with values from 0.0 to 1.0.

However, the data show a range of values over 1 and over 100. Here's an example from the solar table excluding rows where the capacity factor is not 0:

print(len(df_out[(df_out["capacity_factor"]!=0) & (df_out["capacity_factor"]<1)]))
print(len(df_out[(df_out["capacity_factor"]>1) & (df_out["capacity_factor"]<100)]))
print(len(df_out[df_out["capacity_factor"]>100]))

2050181 # Rows between 0-1
69708380 # Rows between 1-100
87123 # Rows over 100

I'm not sure how to interpret or fix this.

aesharpe commented 3 weeks ago

But without a datetime column the data won't be easy to combine with other tables that have datetimes, and you won't be able to use the datetime accessors, etc. small integers don't take up much space in parquet, so having all three of a datetime (for the first 8760 hours), a year, and an hour-of-year column could also be fine

@zaneselvans can you clarify what you mean by "having all three of a datetime (for the first 8760 hours)?

Do you mean:

Make a year column with a datetime type
Make an hour-year column that's also a datetime type
Something about another datetime column but just for the 2019 values (i.e.: "the first 8760 hours")?

zaneselvans commented 3 weeks ago

I was suggesting:

Make a year column that is an integer like 2020
Make an hour-of-year column that is an integer like 5431
Keep a UTC datetime column that has hourly resolution, and from which the other two columns were derived.

That's not great news on the capacity factor values! I'd imagined that he opened the sample data in a spreadsheet and it accidentally got cast to percentages because all the sample values were between 0 and 100, or something like that. If there's ambiguity as to whether it's a percentage or a value between 0 and 1, and if there are really values beyond 100, then we probably need to ask for guidance or another handoff of data that doesn't have this problem.

Can you plot a histogram of the values to see how crazy they are? Is there uniformity within individual counties or years of data? Is there any chance that we've introduced this weirdness?

zaneselvans commented 3 weeks ago

It looks like about half of the values are 0-1, and half of them are 0-100, even within a particular county and year.

import random
csv_path = "/Users/zane/code/catalyst/pudl-input/local/vceregen/2023/Fixed_SolarPV_Lat_UPV_county.csv"
pv = pd.read_csv(csv_path)
random_col = random.choice(pv.columns[1:-1])
plt.hist(pv[random_col], bins=100, range=(0, 100))
plt.title(f"{random_col} (max value: {pv[random_col].max():.2f})")
plt.grid()
plt.show();

zaneselvans commented 3 weeks ago

But I guess we would expect on average 50% of solar hours to have generation of zero. So maybe it's just that all of the values are between 0 and 100, and so percentages. The wind data seems in line with that:

import random
year = 2023
gen_type = "onwind"
files = {
    "pv": "Fixed_SolarPV_Lat_UPV_county",
    "onwind": "Wind_Power_100m_Onshore_county",
    "offwind": "Wind_Power_140m_Offshore_county",
}
csv_path = f"/Users/zane/code/catalyst/pudl-input/local/vceregen/{year}/{files[gen_type]}.csv"
df = pd.read_csv(csv_path)

random_col = random.choice(df.columns[1:-1])
plt.hist(df[random_col], bins=100, range=(0, 100))
plt.title(f"{random_col}; gen: {gen_type}; max: {df[random_col].max():.2f}")
plt.grid()
plt.show();

aesharpe commented 3 weeks ago

Make a year column that is an integer like 2020

I just found out there's a hidden report_year column in amongst allllll the county columns...

aesharpe commented 3 weeks ago

But I guess we would expect on average 50% of solar hours to have generation of zero. So maybe it's just that all of the values are between 0 and 100, and so percentages. The wind data seems in line with that:

ok great. I'll assume that it's all reported as a % (0-100) and convert it to a fraction (0-1). How should we handle the values > 100?

zaneselvans commented 3 weeks ago

I'm not sure how you're extracting the CSVs right now, but if it's using our CSV Extractor, I think we probably add the report_year based on the year partition value automatically.

On the values > 100 what do they look like?

Do they appear in all the datasets? (onshore wind, offshore wind, solar?)
Are they like, wildly more than 100? Or just 101? If it's just a few things that go over the nameplate by a couple of percent that seems fine. If there's values up in the multi-hundreds or thousands then something is very wrong and we'll need to ask some questions.

Also it looks like the CSVs contain an unlabeled index column that starts at 1 and goes to 8760, so it sounds like we've already got our year and hour-of year columns and can use them to create a UTC datetime column 😄

aesharpe commented 3 weeks ago

Also it looks like the CSVs contain an unlabeled index column that starts at 1 and goes to 8760, so it sounds like we've already got our year and hour-of year columns and can use them to create a UTC datetime column 😄

oh I did notice that, but I thought it was important to create these columns based on the datetime column as per your clarification above.

catalyst-cooperative / pudl