earthobservations / wetterdienst

Open weather data for humans.
https://wetterdienst.readthedocs.io/
MIT License
349 stars 54 forks source link

KeyError: 'station_id' in NoaaGhcnRequest #797

Closed nhcb closed 1 year ago

nhcb commented 1 year ago

Describe the bug When getting data for specific stations around Amsterdam, using NoaaGhcnRequest, I get a KeyError 'station_id' even though the stations_object finds stations. When reproducing the code on issue 741 (https://github.com/earthobservations/wetterdienst/issues/741) I don't seem to have a problem.

To Reproduce

from wetterdienst.provider.noaa.ghcn.api import NoaaGhcnRequest, NoaaGhcnParameter
import datetime as dt

stations_object = NoaaGhcnRequest(
    parameter=NoaaGhcnParameter.DAILY.TEMPERATURE_AIR_MIN_200,
    start_date=dt.datetime(2010, 1, 1),
    end_date=dt.datetime(2022, 1, 1)
).filter_by_station_id('NLE00101920')

print(stations_object)
def get_data_from_stations_request(
    stations_object: NoaaGhcnRequest,
) -> pd.DataFrame:
    """
    Takes a stations request object and process queries

    Args:
        stations_object: DwdObservationRequest object that holds all required information
        for downloading opendata dwd data

    Returns:
        DataFrame with content from DwdObservationRequest

    """
    observation_data = []

    for result in stations_object.values.query():
        observation_data.append(result.df)

    return pd.concat(observation_data)

df = get_data_from_stations_request(stations_object)
print(df)

Expected behavior Gets the temperature data

Screenshots image

Desktop (please complete the following information):

nhcb commented 1 year ago

So I tried retrieving rain height (as in the original issue). But this is possible, it seems that there is no temperature data?

gutzbenj commented 1 year ago

So I tried retrieving rain height (as in the original issue). But this is possible, it seems that there is no temperature data?

Dear @nhcb ,

this is absolutely true! I checked it and there's only precip and snow height data available. However still the error should not be thrown like that! Thanks for opening the issue!

Potentially you could use the .summary method to get data from a station somewhere near Amsterdam, putting the latlon for Amsterdam in the call.

nhcb commented 1 year ago

Thanks for your quick reply. How would I go about using the .summary method in this context? Also, is there some way to quickly check which parameters are available for which station? I really appreciate the effort put in the package, well done.

gutzbenj commented 1 year ago

So regarding using the summary method you could go like:

amsterdam = (YY.YY, XX.XX)  # latlon for Amsterdam / 'NLE00101920'
request = NoaaGhcnRequest(
    parameter=NoaaGhcnParameter.DAILY.TEMPERATURE_AIR_MIN_200,
    start_date=dt.datetime(2010, 1, 1),
    end_date=dt.datetime(2022, 1, 1)
).summary(*amsterdam)

There is NO way to check the availability of data per station with wetterdienst, because it would require to download and process each station first. At least for NOAA GHCN the number of stations is 100k+ so the amount of work for your machine would be insanely high.

However NOAA themselves offer a service at [1], which gives detailed information on each station in GHCN.

Potential other sources:

[1] https://www.ncei.noaa.gov/cdo-web/datasets/GHCND/stations/GHCND:NLE00101920/detail

nhcb commented 1 year ago

Awesome, thanks for the elaboration. I will be sure to use this in the future. Feel free to close this issue :).

nhcb commented 1 year ago

Hi, before closing this issue. I wanted to obtain data from De Bilt. However, I found this also not to be possible; even though I know for sure this is a measurement station. It's both in the inventory file, as it is a a widely known station around the Netherlands.

# Select a station and load the data for the last 10 years
from wetterdienst.provider.noaa.ghcn.api import NoaaGhcnRequest, NoaaGhcnParameter
import datetime as dt

stations_object = NoaaGhcnRequest(
    parameter=[NoaaGhcnParameter.DAILY.TEMPERATURE_AIR_MEAN_200],
    start_date=dt.datetime(2015, 1, 1),
    end_date=dt.datetime(2022, 1, 1)
).filter_by_name('DE BILT')

print(stations_object)
def get_data_from_stations_request(
    stations_object: NoaaGhcnRequest,
) -> pd.DataFrame:
    """
    Takes a stations request object and process queries

    Args:
        stations_object: DwdObservationRequest object that holds all required information
        for downloading opendata dwd data

    Returns:
        DataFrame with content from DwdObservationRequest

    """
    observation_data = []

    for result in stations_object.values.query():
        observation_data.append(result.df)

    return pd.concat(observation_data)

df = get_data_from_stations_request(stations_object)
# Pivot the data based on parameter
df = df.pivot(index='date', columns='parameter', values='value').reset_index()
print(df)
# Plot the data
import matplotlib.pyplot as plt
plt.figure(figsize=(12,6))
plt.plot(df['date'], df['temperature_air_max_200'], label='Max temp')
#plt.plot(df['date'], df['temperature_air_min_200'], label='Min temp')
plt.title('Temperature at De Bilt')
plt.legend()
plt.show()

The code returns the following error. Please know I am reporting this to further the progress of the library! image

gutzbenj commented 1 year ago

Dear @nhcb ,

once again thanks for reporting such issues! It's the beauty of the detail to bring up such errors/hiccups, I will just look into it.

Edit1:

Turns out we can not localize the given date with "Europe/Amsterdam" and the following will give us the error NonExistentTimeError:

import pandas as pd
import pytz

df = pd.DataFrame({"date": ["1914-11-08 00:00:00"]})
df.date = pd.to_datetime(df.date).dt.tz_localize(pytz.timezone("Europe/Amsterdam"))

because somehow when using pd.to_datetime it will automatically add timezone UTC and then when localizing run into error because dates like 1914-11-08 doesn't exist in UTC. However we can pass notexistent="shift_forward" to tz_localize which automatically then chooses the next date (in this case 1914-11-08 01:00:00 +01

gutzbenj commented 1 year ago

Dear @nhcb , with the latest release the code from above should now be working, however for the first request you would still not get any data.

nhcb commented 1 year ago

Cool, thanks! I appreciate the quick support :). I will give it a try.

nhcb commented 1 year ago

Hi @gutzbenj,

Thanks for the previous support. I apologise for coming back with his issue only later, however I noticed that not all data seems to be loaded in the NOAGhcnRequest.

When I run the following code for 'De Bilt' (thanks for fixing), it looks like there is systematic data (every year) missing between April and October. See:

# Select a station and load the data for the last 10 years
from wetterdienst.provider.noaa.ghcn.api import NoaaGhcnRequest, NoaaGhcnParameter
import datetime as dt

stations_object = NoaaGhcnRequest(
    parameter=[NoaaGhcnParameter.DAILY.TEMPERATURE_AIR_MEAN_200],
    start_date=dt.datetime(2015, 1, 1),
    end_date=dt.datetime(2022, 1, 1)
).filter_by_name('DE BILT')

print(stations_object)
def get_data_from_stations_request(
    stations_object: NoaaGhcnRequest,
) -> pd.DataFrame:
    """
    Takes a stations request object and process queries

    Args:
        stations_object: DwdObservationRequest object that holds all required information
        for downloading opendata dwd data

    Returns:
        DataFrame with content from DwdObservationRequest

    """
    observation_data = []

    for result in stations_object.values.query():
        observation_data.append(result.df)

    return pd.concat(observation_data)

df = get_data_from_stations_request(stations_object)
# Pivot the data based on parameter
df = df.pivot(index='date', columns='parameter', values='value').reset_index()
print(df)
# Plot the data
import matplotlib.pyplot as plt
plt.figure(figsize=(12,6))
plt.plot(df['date'], df['temperature_air_mean_200'], label='Avg temp')
#plt.plot(df['date'], df['temperature_air_min_200'], label='Min temp')
plt.title('Temperature at De Bilt')
plt.legend()
plt.show()

# Select subset as subplot 
subset = df[1:365]
plt.figure(figsize=(12,6))
plt.plot(subset['date'], subset['temperature_air_mean_200'], label='Avg temp')
#plt.plot(df['date'], df['temperature_air_min_200'], label='Min temp')
plt.title('Temperature at De Bilt (subset 1)')
plt.legend()
plt.show()

# Select different as subplot 
subset = df[1000:1500]
plt.figure(figsize=(12,6))
plt.plot(subset['date'], subset['temperature_air_mean_200'], label='Avg temp')
#plt.plot(df['date'], df['temperature_air_min_200'], label='Min temp')
plt.title('Temperature at De Bilt (subset 2)')
plt.legend()
plt.show()

image image

I Checked for MALIN HEAD, and while it wasn't the same dates and no systematics, there were certainly data missing. image image

gutzbenj commented 1 year ago

Dear @nhcb ,

all I can say for now is I HATE timezones :-D it is definitely related to the DST...

nhcb commented 1 year ago

Hi @gutzbenj,

I can imagine... There's a reason I convert to UTC instantly on my other projects ;). Thanks again for the quick support!

gutzbenj commented 1 year ago

Drafted a new release. Please check it once again!

nhcb commented 1 year ago

Awesome, I checked now for MALIN HEAD: image

image image

And for DE BILT: image image image

Not to be solely the bearer of bad news :(, however, it seems that for DE BILT somewhere between 2015-05 and 2015-06 (subset 1) there is missing data. Same for some dates in 2018-09 (subset 2). Subset 2 seems to again be due to DTS. The dates for 2015-05 I don't know why this is.

Perhaps it's an idea to bring this to some other communication channel? I would like to contribute as well, so I can get some complete time series!

I appreciate the support, enjoy the sunday!

gutzbenj commented 1 year ago

Dear @nhcb ,

thanks for the feedback! The gaps on DE BILT are actually all legit!

Regarding general data: We may implement data of the Dutch weather service if they are available for free and you are surely welcome to contribute in an way! The QC by you was already really helpful!