NIVANorge / catchment_processing_workflows

Examples and documentation for common workflows used by Catchment Processes (Section 317)
https://nivanorge.github.io/catchment_processing_workflows/
0 stars 0 forks source link

get_metno_ngcd_time_series error #1

Open Maginor opened 2 years ago

Maginor commented 2 years ago

I get this error when using the function do download thredds data. It worked when I had a shorter time span for the start and end dates.

image

JamesSample commented 2 years ago

From a quick check, I see that my error handling in that part of my code is awful - it's only explicitly catching a specific OSError error. Other kinds of OSError are silently ignored!

I think your code must be getting an OSError from Thredds, but it's not being passed on correctly by NivaPy. Try the function below (with all error handling removed) and see what you get.

def get_metno_ngcd_time_series(
    loc_df,
    par,
    st_dt,
    end_dt,
    id_col="station_code",
    xcol="longitude",
    ycol="latitude",
    crs="epsg:4326",
    ngcd_version="21.09",
    ngcd_type="type2",
):
    """Query time series from the NGCD based on points provided in a dataframe.
    Data is downloaded from Thredds here:

        https://thredds.met.no/thredds/catalog/ngcd/catalog.html

    Due to bandwidth restrictions imposed by Met, the function can be very slow.

    Args
        loc_df:       DataFrame. Locations of interest
        par:          Str. NGCD parameter of interest. One of the following:
                          'TG': daily mean temperature (K)
                          'TN': daily minimum temperature (K)
                          'TX': daily maximum temperature (K)
                          'RR': daily precipitation sum (mm)
        st_dt:        Str. Start date of interest 'YYYY-MM-DD'
        end_dt:       Str. End date of interest 'YYYY-MM-DD'
        id_col:       Str. Name of column in 'loc_df' containing a unique ID for each location
                      of interest
        xcol:         Str. Name of colum in 'loc_df' containing 'eastings' (i.e. x or longitude)
        ycol:         Str. Name of colum in 'loc_df' containing 'northings' (i.e. y or latitude)
        crs:          Str. A valid co-ordinate reference system for Geopandas. Most easily
                      expressed as e.g. 'epsg:4326' (for WGS84 lat/lon) or 'epsg:25833'
                      (ETRS89/UTM zone 33N) etc.
        ngcd_version: Str. Default '21.09'. Version of NGCD to use. See
                          https://thredds.met.no/thredds/catalog/ngcd/catalog.html
        ngcd_type:    Str. Either 'type1' or 'type2'. Default 'type2'. Interpolation method to use.
                      See
                          https://thredds.met.no/thredds/catalog/ngcd/catalog.html

    Returns
        Dataframe of time series data.
    """
    # Validate user input
    assert len(loc_df[id_col].unique()) == len(loc_df), "ERROR: 'id_col' is not unique."

    assert par in [
        "TG",
        "TN",
        "TX",
        "RR",
    ], "'par' must be one of ('TG', 'TN', 'TX', 'RR')."

    assert ngcd_type in [
        "type1",
        "type2",
    ], "'ngcd_type' must be either 'type1' or 'type2'."

    orig_len = len(loc_df)
    loc_df.dropna(subset=[id_col, xcol, ycol], inplace=True)
    if len(loc_df) < orig_len:
        print(
            "WARNING: 'loc_df' contains NaN values in the 'id_col', 'xcol' or 'ycol' columns. These rows will be dropped."
        )

    # Build geodataframe and reproject to CRS of NGCD
    ngcd_crs = "epsg:3035"
    gdf = gpd.GeoDataFrame(
        loc_df.copy(),
        geometry=gpd.points_from_xy(loc_df[xcol], loc_df[ycol], crs=crs),
    )
    gdf = gdf.to_crs(ngcd_crs)
    gdf["x_proj"] = gdf["geometry"].x
    gdf["y_proj"] = gdf["geometry"].y

    # Build OPENDAP URLs for period of interest
    dates = pd.date_range(st_dt, end_dt, freq="D")
    years = [date.year for date in dates]
    months = [date.month for date in dates]
    dates = [date.strftime("%Y%m%d") for date in dates]
    base_url = f"https://thredds.met.no/thredds/dodsC/ngcd/version_{ngcd_version}/{par}/{ngcd_type}/"
    urls = [
        f"{base_url}{years[idx]}/{months[idx]:02d}/NGCD_{par}_{ngcd_type}_version_{ngcd_version}_{date}.nc"
        for idx, date in enumerate(dates)
    ]

    print("Concatenating files from Thredds. This may take a while...")
    ds = xr.open_mfdataset(
        urls,
        concat_dim="time",
        combine="nested",
        data_vars="minimal",
        coords="minimal",
        compat="override",
        parallel=True,
    )

    print("Extracting values for points. This may take a while...")
    x_indexer = xr.DataArray(gdf["x_proj"], dims=[id_col], coords=[gdf[id_col]])
    y_indexer = xr.DataArray(gdf["y_proj"], dims=[id_col], coords=[gdf[id_col]])
    pts_ds = ds[par].sel(X=x_indexer, Y=y_indexer, method="nearest")
    pts_df = pts_ds.to_dataframe().reset_index()
    pts_df.rename({"time": "datetime"}, inplace=True, axis="columns")

    return pts_df

My original code only checks for OSErrors like "[Errno -90] NetCDF: file not found", which I think is what Thredds throws when you request a date/file that doesn't exist. Other kinds of error should be reported correctly, but if your code generates an OSError that isn't Errno -90 it'll just fail silently (and then fail to find the ds variable later).

Sorry about the silent error - I'll update the code in NivaPy to handle this better.

JamesSample commented 2 years ago

@Maginor I've updated the code in NivaPy so that you should at least see the correct error now. Could you either try the query again and let me know what error you get, or let me know the query parameters (date ranges etc.) so I can test it myself.

Thanks!

Maginor commented 2 years ago

I get "OSError: no files to open" image The xlsx file is in the quantom repository. But you could just copy those coordinates.

It works if I just do 2010. Maybe it needs to check the presence of the files before it starts concatenating? It usually has 1971 - now-ish though

JamesSample commented 2 years ago

Your st_dt in that notebook is set to 2075. Do you mean 1975? At present your date range will be empty, so no files to query.

I should add a check that end_dt > st_dt.

The easiest way I found to check for the presence of files on Thredds was just to try opening them (i.e. concatenating them), and then catch the error if they're not there. This is the [Errno -90] NetCDF: file not found error that I mention above. When this occurs, the function should show the error and then provide you with a link to check the available files for NGCD on Thredds.

However, in your notebook screenshot, there are no dates between st_dt and end_dt, so it's not trying to query anything.

Maginor commented 2 years ago

Hmm, it seems to work now when I set it to 1971. I thought I had it at 1975 earlier and that 2075 is just an error that happened when I retyped it for this report. Still, it would be good to check for basic errors, and to properly report missing files.

Maginor commented 2 years ago

Actually, with the new error reporting it now shows that it doesn't have data to the end of 2021. That was probably the issue.

JamesSample commented 2 years ago

I think that's a genuine issue on Thredds: when I wrote the function a couple of weeks ago, they only had data until September 2021. I see that within the last few days they've added a new version (but only for type2 so far). See the link included in the error handling.

You can specify the NGCD version in the call to the function - see the example notebook for details. If you pass ngcd_version="22.03", you should be able to get data to the end of 2021 (but only for type2 at present).

If Met are going to update the NGCD version every few months, I'll modify the function to find the latest version and default to that (rather than 21.09, which is the default at present). That wasn't clear to me when I wrote the function, but I see now that they release a new version roughly every 6 months.

I think missing files were properly reported in the previous version (i.e. the error you're getting now). The error you originally reported is different, but would be caused by passing in an empty series of dates (i.e. st_dt > end_dt). I'll add a check for this.

All of the error handling in these functions needs improving, but I want to get a functional outline first.

Maginor commented 2 years ago

New error:

syntax error, unexpected $end, expecting SCAN_ATTR or SCAN_DATASET or SCAN_ERROR context: ^ syntax error, unexpected $end, expecting SCAN_ATTR or SCAN_DATASET or SCAN_ERROR context: ^

...

KeyError: [<class 'netCDF4._netCDF4.Dataset'>, ('https://thredds.met.no/thredds/dodsC/ngcd/version_21.09/TG/type2/2009/05/NGCD_TG_type2_version_21.09_20090515.nc',), 'r', (('clobber', True), ('diskless', False), ('format', 'NETCDF4'), ('persist', False))]

During the above exception, another exception occured: .....

OSError: [Errno -72] NetCDF: Malformed or inaccessible DAP2 DDS or DAP4 DMR response: b'https://thredds.met.no/thredds/dodsC/ngcd/version_21.09/TG/type2/2009/05/NGCD_TG_type2_version_21.09_20090515.nc'

JamesSample commented 2 years ago

That's a Thredds error. It implies that either the dataset for 15.05.2009 on Thredds is corrupt or (more likely) your network connection to Thredds was interrupted while the code was running.

I've just tried running the example notebook with

par = "TG"
st_dt = "2009-01-01"
end_dt = "2010-01-01"

and it finished OK, so the dataset seems valid. I guess that means your Thredds connection was interrupted.

There's not a lot I can do about this, but one option is to apply to Met for priority access. I believe "public" access to Thredds is heavily throttled to keep bandwidth manageable, but it is possible to be "whitelisted" and given priority access. I don't know how this is evaluated, but presumably you need to make a case as to why you need better access.

The other option is to spilt your date range into chunks and query a few years at a time. Then at least you don't have to start from scratch if the network connection to Thredds gets dropped.

JamesSample commented 2 years ago

Do you specifically need NGCD? The GTS API (get_nve_gts_api_time_series) is much nicer to use.

Maginor commented 2 years ago

I just wanted to test it. I can maybe make a wrapper that gets a year at a time. But for now I can just use GTS.

JamesSample commented 2 years ago

Thanks for testing. I understand that using NGCD is annoying. ERA5 is similar, although at least the cdsapi seems more robust, in that it does eventually prepare the data you ask for, whereas "public" Thredds occasionally just gives up.

I don't think there's any way of improving this at NIVA's end (but suggestions welcome). As mentioned in the notebook, I sometimes find running Thredds queries overnight is faster and more reliable.