adamancer / pyncei

Use Python to access data from NOAA's Climate Data Online Web Services v2 API
MIT License
24 stars 6 forks source link

Key error for some stations #5

Open skfrost01 opened 8 months ago

skfrost01 commented 8 months ago

There is a good chance this is a user error, but I am running into the following error, specifically when pulling GHCND and PRCP data. If I follow the example and generate a response, there appears to be data, but using to_dataframe() throws the following error for some stations:

`File "/PycharmProjects/Regenerate/.venv/lib/python3.12/site-packages/IPython/core/interactiveshell.py", line 3553, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "", line 1, in response.to_dataframe() File "/PycharmProjects/Regenerate/.venv/lib/python3.12/site-packages/pyncei/bot.py", line 1068, in to_dataframe df = pd.DataFrame(self.values()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/PycharmProjects/Regenerate/.venv/lib/python3.12/site-packages/pandas/core/frame.py", line 832, in init data = list(data) ^^^^^^^^^^ File "/PycharmProjects/Regenerate/.venv/lib/python3.12/site-packages/pyncei/bot.py", line 1010, in values yield {k: val[k] for k in self.key_order if k in keys}


KeyError: 'station'`

Is there anything I can do to avoid this or check for this issue before running `to_dataframe()` to avoid erroring out? 
adamancer commented 8 months ago

Can you please provide an example of code that produces this error?

skfrost01 commented 8 months ago

Thank you for the quick response! year = 2023 response = NCEIResponse() while year >= 2000: response.extend( ncei.get_data( datasetid="GHCND", stationid='GHCND:USC00140637', datatypeid=["PRCP"], startdate=date(year, 1, 1), enddate=date(year, 12, 31) ) )

    year -= 1

df_precip_temp = response.to_dataframe()

This is an example of one of the stations that throws this error for me. There are lots, such as GHCND:USC00347390 that work as expected.

skfrost01 commented 8 months ago

This gives a similar error, but with a different: KeyError: 'elevation

x = ["FIPS:56"] stations = ncei.get_stations( datasetid="GHCND", datatypeid=["PRCP"], locationid=x, startdate=mindate, enddate=maxdate, ) df_stations = stations.to_dataframe()

adamancer commented 8 months ago

Both these errors result from bugs in how this library handles missing data. In the first example, it looks like there is a gap in the data for that station between 1951 and 2003; the missing years are causing the errors. In the second, certain stations are missing the elevation parameter. I've patched the issue on GitHub but expect it will be a bit before I do a new release. In the meantime, you can install the GitHub code as follows:

git clone https://github.com/adamancer/pyncei
cd pyncei
pip install .

And here is a version of your code that should catch the missing years. It does require you to use the development code.

response = NCEIResponse()
for year in range(2000, 2024):
    resp = ncei.get_data(
        datasetid="GHCND",
        stationid='GHCND:USC00140637',
        datatypeid=["PRCP"],
        startdate=date(year, 1, 1),
        enddate=date(year, 12, 31)
    )

    if resp:
        response.extend(resp)
    else:
        print(f"No data found for {year}")

response.to_dataframe()

Let me know if that solves the problem for you.

skfrost01 commented 8 months ago

Thanks for digging into this! It works now for USC00140637 and generally seems to be getting a higher success rate, but there are still some stations that are failing. For example: USC00031459 USC00145870 USC00340017 I have this set up to pull stations within a radius of a point, so these are just a random selection of ones that failed.

adamancer commented 8 months ago

Can you please provide the code that is producing the error? When I plug those stations into the code above, it seems to run fine.

skfrost01 commented 8 months ago

Here is my uncommented, data science-esque code in all of its inefficient glory... Maybe I did something wrong and am still using the original pyncei code?

lat = 35.00
lon = -97.05
distance = 125 #km

df_stations = pd.read_csv('stations.csv')
gdf_stations = gpd.GeoDataFrame(df_stations,
                                geometry=gpd.points_from_xy(df_stations['longitude'], df_stations['latitude']),
                                crs='EPSG:4326')

gdf_stations_proj = gdf_stations.to_crs('EPSG:3395')
site = gpd.GeoSeries([Point(lon, lat)], crs='EPSG:4326').to_crs('EPSG:3395')

gdf_stations_proj['distance'] = gdf_stations_proj.distance(site[0])
gdf_ref = gdf_stations_proj[gdf_stations_proj['distance'] <= distance * 1000]  # Filter for distances within set distance
df_precip = pd.DataFrame()

for id in gdf_ref["id"].unique():
    year = 2023
    ncei = NCEIBot("********************************", cache_name="ncei")
    response = NCEIResponse()
    for year in range(2000, 2024):
        resp = ncei.get_data(
            datasetid="GHCND",
            stationid=id,
            datatypeid=["PRCP"],
            startdate=date(year, 1, 1),
            enddate=date(year, 12, 31)
            )
        if resp:
            response.extend(resp)
        else:
            print(f"No data found for {year}")

    df_precip_temp = response.to_dataframe()
    df_precip = pd.concat([df_precip, df_precip_temp])

I am also attaching a copy of stations.csv which is a bulk pull using ncei.get_stations stations.csv

adamancer commented 8 months ago

Hmm I can't reproduce the error without falling back to the release version on PyPI. I'm a little mystified by the error popping up for these stations but not for the station we discussed earlier. Is the traceback the same?

Can you run pip freeze in your command line and locate pyncei in the output? If you've installed it from PyPI, it should show up as pyncei==1.0, otherwise there should be a path to a file on your computer.

And a friendly word of warning--you don't want to share an API token publicly. I tried to obscure it above but it's still in the comment history. Be careful pasting code in a public forum.

skfrost01 commented 8 months ago

Yep, you're right, I didnt install the updated version correctly the first time (still not sure how that station ran, I checked it like 3 times). Anyways, appreciate the help!

YufanZheng commented 6 months ago
屏幕截图 2024-05-03 130027 屏幕截图 2024-05-03 125935 屏幕截图 2024-05-03 125750
YufanZheng commented 6 months ago

Hello, I'm having a similar issue. I checked the version of the package and the version is 1.0. I also checked if NOAA has a request response, it seems that the server is providing data, but the package can't convert it into a data frame.

I would appreciate it if you could help me with this.

YufanZheng commented 6 months ago
Weixin Image_20240503132402 Weixin Image_20240503132502

I fixed the issue. When the response is 1, it is actually missing. As a result, they make mistakes when stitching data from different years. I modified the code to fix the occurrence of this exception.