bopen / c3s-eqc-toolbox-template

CADS Toolbox template application
Apache License 2.0
5 stars 4 forks source link

Notebook: glaciers - some issues (WP5) #146

Closed yoniv1 closed 6 months ago

yoniv1 commented 6 months ago

Notebook description

There are some issues to be resolved in the notebook:

Notebook link or upload

glacier_distribution31324.zip

Anything else we need to know?

No response

Environment

malmans2 commented 6 months ago

Is this a new template, or do you want to change an existing template? If the latter, please let me know the name of the template to change in this repository.

yoniv1 commented 6 months ago

Hi Mattia, This is an update of https://github.com/bopen/c3s-eqc-toolbox-template/blob/main/notebooks/wp5/glacier_distribution.ipynb but I also included a part of https://github.com/bopen/c3s-eqc-toolbox-template/blob/main/notebooks/wp5/glacier_mass_balance_timeseries.ipynb

malmans2 commented 6 months ago

Are you sure you started from the latest version of the template? It doesn't look like you are using the code we revised.

Let's start with the bars. This is the code to make bars in the template:

year = pd.DataFrame(
    [gdf["BGNDATE"].str[:4].astype(int), gdf["ENDDATE"].str[:4].astype(int)]
)
gdf["year"] = year.where(year > 0).mean().round().astype("Int64")
size = gdf.set_index("year").groupby("year").size()
ax = size.plot.bar(figsize=(15, 5), grid=True, ylabel="Number of glaciers")

There's much more code in your notebook and the results are different. The template show no data in 2012, your code show no data in 2013. Which one is wrong?

yoniv1 commented 6 months ago

Are you sure you started from the latest version of the template? It doesn't look like you are using the code we revised.

Let's start with the bars. This is the code to make bars in the template:

year = pd.DataFrame(
    [gdf["BGNDATE"].str[:4].astype(int), gdf["ENDDATE"].str[:4].astype(int)]
)
gdf["year"] = year.where(year > 0).mean().round().astype("Int64")
size = gdf.set_index("year").groupby("year").size()
ax = size.plot.bar(figsize=(15, 5), grid=True, ylabel="Number of glaciers")

There's much more code in your notebook and the results are different. The template show no data in 2012, your code show no data in 2013. Which one is wrong?

Hi Mattia,

Indeed, I changed a bit the code. Instead of just using the years of the dates (e.g. in our earlier example 20090703 would become 2009), I changed the code so that it transforms the date into decimal years, also taking into account the month and day of the year (e.g. 20090703 would become 2009.6 or so). If you round those decimal year results (as done for the bar plot), the resulting date to be used in the bar plot would be 2010 instead of 2009, hence the slightly different results I suppose. If you do .apply(np.floor) instead of .round(), the results would shift a bit and the bar plot would again show no data in 2012 as in the original example (maybe np.floor is indeed better to do). I did this transformation to decimal years rather than the "cut-off year" from the attribute table in order to more accurately calculate the average and area-weighted average year of digitization of the entire dataset later in the notebook (with gdf["year_dec"]). Results in the Python code are then also the same is with my Matlab program:

barplot_matlab barplot_python
malmans2 commented 6 months ago

mmm I still can't reproduce your results. I added the code to do what you described, but I get different results (see the latest template). Anyways, not a big deal, you can change the computation of the digitalization year the way you prefer for your notebook.

Under the bar plot, I have printed "Date of digitization data are missing for 604 glaciers or 0.2790753549662938% of the dataset.". I would like to have this printed as a text box on the figure itself (left upper corner) with 2 digit precision (0.28%). Also, the years on the x axis do not seem to be uniform (some years are skipped in the labels)?

This is now implemented. Not sure what was the problem with text boxes, but it worked fine for me. I also added missing years to the dataframe (they are not shows as they are not in the data).

Here is the template: https://github.com/bopen/c3s-eqc-toolbox-template/blob/main/notebooks/wp5/glacier_distribution.ipynb Here is the notebook executed: https://gist.github.com/malmans2/d8dda9942a0eaa227526be0c99887895

I'll take a look at the other requests later.

malmans2 commented 6 months ago

~Actually, nevermind. I was looking at the old figure. I get the same results now.~

Sorry again. I was actually right, I can not reproduce your results.

malmans2 commented 6 months ago

Unless I misunderstood what you are trying to do, I'm pretty sure you have a bug. Neither BGNDATE nor ENDDATE show records in 2014.

malmans2 commented 6 months ago

Same for "The global arithmetic mean year of digitization is 2003.079402293525 AD, while the global area-weighted mean is 2001.2994250034699 AD." (lower left corner of the plot).

I added this, but my results are quite different. I get 2000.22 vs 2000.72. See the latest template.

I have made a 0.5 degree grid (rgi_raster) with values for the RGI region of each pixel (the value on the grid varies from 1 to 19 and it is 1x360x720 in dimensions, so the spatial domain and resolution are similar to the mass change data that are loaded in from the CDS). The RGI region is just a region where the glacier is situated in, and there are 19 of them (e.g. New-Zealand, Caucasus, Alaska, etc. and they all have a number up until 19). Anyway, what you do now in the code is calculate the cumulative glacier mass change for all grid points in the world domain (360x720) for each time step (46 years), i.e. the variable "Cumulative" in "ds2" (see print) is a 46-data point array ("ds2 = ds2.sum(("latitude","longitude"),keep_attrs=True) "). What I would like is such a cumulative glacier mass change array for each of the RGI regions. So, it could be a loop through values 1 to 19 (if rgi_raster == 1, if rgi_raster == 2, ...) and then calculate the cumulative mass change of all glaciers (mass change pixels) that have a rgi_raster value of 1, 2, ... 19. In the end, I would like to have 19 times a 46-year long cumulative glacier mass change arrays (for rgi region 1, 2, ... 19, and preferably also 1 additional cumulative glacier mass change array with all spatial grid points included as there is now, e.g. put if rgi_raster > 0 or so). These can be named cumulative_rgi_1, cumulative_rgi_2, ... as per example. I hope it is clear.

Sorry but this is not clear. You said "what you do now in the code is calculate the cumulative glacier mass change", but I don't see such computation in the template. All plots show the year of digitalization. Also, I don't see why the interpolation is needed if you already know the region for all data in the geodataframe. I guess I need to know how to compute the mass change first, then we can test on the whole dataset, and finally I can show you how to do it for single regions.

yoniv1 commented 6 months ago

Hi Mattia,

I have to go now, I will look into it later tonight. Thanks anyway!

malmans2 commented 6 months ago

No rush from my side. Have a good evening!

yoniv1 commented 6 months ago

So for my third bullet point:

The code below I have written myself, it produces a raster with RGI regions based on the attribute table of the loaded shapefile from the glacier extent data on the CDS (it is included in the notebook I have uploaded in this thread):

## Create a 0.5 degree map of pixels with corresponding RGI regions
# Extract relevant data
rgi = gdf['region'].values
lon = gdf['CENLON'].values
lat = gdf['CENLAT'].values
# Set resolution and create grid
resolution = 0.5
x_range = np.arange(lon.min(), lon.max()+resolution, resolution)
x_range = np.arange(-179.75,179.75+resolution,resolution)
y_range = np.arange(lat.min(), lat.max()+resolution, resolution)
y_range = np.arange(-89.75,89.75+resolution,resolution)
grid_x, grid_y = np.meshgrid(x_range, y_range)
# Interpolate data onto grid
points = list(zip(lon, lat))
grid_z = griddata(points, rgi, (grid_x, grid_y), method='nearest')
# Define transformation and CRS
transform = Affine.translation(grid_x[0][0]-resolution/2, grid_y[0][0]-resolution/2) * Affine.scale(resolution, resolution)
crs = CRS.from_epsg(4326)
# Write interpolated raster to file
interp_raster = rasterio.open('RGI.tif',
                              'w',
                              driver='GTiff',
                              height=grid_z.shape[0],
                              width=grid_z.shape[1],
                              count=1,
                              dtype=grid_z.dtype,
                              crs=crs,
                              transform=transform)
interp_raster.write(grid_z, 1)
interp_raster.close()
rgi_raster=(rioxarray.open_rasterio("RGI.tif").round())

The rgi_raster looks like this (it has values going from 1 to 19 and is 360x720 in spatial dimensions):

rgi_raster

Next, the following code is something you wrote to calculate cumulative glacier mass changes (in https://github.com/bopen/c3s-eqc-toolbox-template/blob/main/notebooks/wp5/glacier_mass_balance_timeseries.ipynb). It sources from another CDS data set (both glacier extent and mass changes are loaded in here):

# Calculate glacier mass changes
ds2 = ds2.sum(("latitude","longitude"),keep_attrs=True)
ds2["time"] = ds2["time"].dt.year
ds2["time"].attrs |= {"long_name": "Time", "units": "yr"}
for da in ds2.data_vars.values():
    da.attrs["long_name"] = da.attrs["long_name"].replace("_", " ").title()
cumulative = ds2["Glacier"].cumsum("time")
cumulative.attrs = {
    "units": ds2["Glacier"].attrs["units"],
    "long_name": f"Cumulative {ds2['Glacier'].attrs['long_name']}",
}
ds2["Cumulative"] = cumulative

It produces an xarray.Dataset that looks as follows:

print(ds2)
<xarray.Dataset>
Dimensions:      (time: 46)
Coordinates:
  * time         (time) int64 1976 1977 1978 1979 1980 ... 2018 2019 2020 2021
Data variables:
    Glacier      (time) float64 dask.array<chunksize=(1,), meta=np.ndarray>
    Uncertainty  (time) float64 dask.array<chunksize=(1,), meta=np.ndarray>
    Cumulative   (time) float64 dask.array<chunksize=(1,), meta=np.ndarray>
Attributes:
    title:         Global gridded annual glacier mass change product
    project:       Copernicus Climate Change Service (C3S) Essential Climate ...
    data_version:  version-wgms-fog-2022-09
    institution:   World Glacier Monitoring Service - Geography Department - ...
    created_by:    Dr. Ines Dussaillant   ines.dussaillant@geo.uzh.ch
    references:    Fluctuation of Glagiers (FoG) database version wgms-fog-20...
    citation:      
    Conventions:   CF Version CF-1.8
    comment:       Brief data description:Horizontal resolution:\t0.5° (latit...

And a cumulative glacier mass change array for every year (46 time dimensions):

print(ds2["Cumulative"].values)
[   90.71168102    80.79513611    72.87584419    52.82544759
   133.55242267   219.76393117   192.97008934   333.400565
   304.19751212   336.89578928   406.74458716   541.73057733
   554.34783331   536.2654241    287.82241853   144.70144587
   278.37364045   190.52569603   101.31548142   -99.13062512
  -183.00136248  -510.8472703   -669.49067903  -898.94472138
 -1043.37351895 -1205.67214689 -1366.43773165 -1556.73190548
 -1935.8221387  -2338.93546432 -2624.26239728 -2964.72087477
 -3139.39551596 -3397.19516061 -3694.51792765 -4084.66093178
 -4315.92381983 -4570.26046711 -4736.14224852 -4955.86465451
 -5334.61430501 -5608.31446033 -5825.96604111 -6326.7830968
 -6725.46587002 -7061.74942072]

However, the cumulative mass change array above is here calculated for every grid point of the world (all glaciers, within rgi regions 1 to 19 or thus for the entire 360x720 spatial grid, because the sum is taken over all latitudes and longitudes). What I wanted to do is to calculate 19 different cumulative mass change arrays, for each RGI region separately, based on rgi_raster. Ofcourse, each RGI region has specified lat,lon boundaries, defined by rgi_raster. So, the outcome should be a cumulative glacier mass change array for all pixels that have glaciers in rgi_raster == 1, rgi_raster == 2, etc. until rgi_raster == 19. So, 19 times (19 regions) a 46-year long (time dimension = 46 years) cumulative glacier mass change arrays. The ds2["Glacier"] with t,lat,lon dimensions 46x360x720 holds these mass change data. That is why I did the interpolation of RGI data from the attribute table of the shapefile of the glacier_extent data to an rgi_raster, because the raster now has the same spatial dimensions (1x360x720) as the glacier mass change data that are loaded in (46x360x720). This would make it easier doing calculations. I hope it is clear now!

malmans2 commented 6 months ago

Unfortunately I'm still a little confused. ds2 suddenly appears both in your notebook and in the snippet. What is ds2? Do you want to use data from insitu-glaciers-extent or derived-gridded-glacier-mass-change.

Or maybe you are referring to the other template? So in glacier_mass_balance_timeseries.ipynb you want to infer regions from insitu-glaciers-extent rather then the shapefile previously used.

(I noticed that you edited your previous message to include the name of the other template. I've only seen the very first version of your message yesterday, with just one template. Please don't edit messages if you are adding important information, I don't get notified when you edit messages and I deal with many different notebooks)

malmans2 commented 6 months ago

Ohh I see, you have commented code in your notebook that refers to a ds2 created at the very beginning of the notebook. So I guess you want to use "derived-gridded-glacier-mass-change"?

malmans2 commented 6 months ago

Hi @yoniv1,

I don't fully follow what's happening in the code. Why are you computing the mask raster instead of using the same shapefile used to classify the glaciers? It's publicly available from the glims website, and it looks like it's pretty much the same you've been using in the other notebook from the gtn-g website.

import geopandas as gpd
import regionmask
import cartopy.crs as ccrs

shapefile_url = "https://www.glims.org/RGI/rgi60_files/00_rgi60_regions.zip"
gdf = gpd.read_file(shapefile_url)
gdf = gdf.dissolve(by="RGI_CODE", as_index=False)
regions = regionmask.from_geopandas(
    gdf, numbers="RGI_CODE", names="FULL_NAME", abbrevs="WGMS_CODE"
)
ax = regions.plot(
    projection=ccrs.Robinson(),
    add_ocean=True,
)
ax.set_global()

map

malmans2 commented 6 months ago

Also, my masks look quite different:

import geopandas as gpd
import regionmask
import cartopy.crs as ccrs
from c3s_eqc_automatic_quality_control import download, plot

# Glacier mass change data (CDS)
period_start = "1975_1976"
period_stop = "2020_2021"
assert all("_" in period and len(period) == 9 for period in (period_start, period_stop))
y0_start, y1_start = map(int, period_start.split("_"))
y0_stop, y1_stop = map(int, period_stop.split("_"))
collection_id2 = "derived-gridded-glacier-mass-change"
request2 = {
    "variable": "glacier_mass_change",
    "product_version": "wgms_fog_2022_09",
    "format": "zip",
    "hydrological_year": [
        f"{y0}_{str(y1)[-2:]}"
        for y0, y1 in zip(range(y0_start, y0_stop + 1), range(y1_start, y1_stop + 1))
    ],
}
ds2 = download.download_and_transform(
    collection_id2,
    request2,
    chunks={"hydrological_year": 1},
)

# shapefile
shapefile_url = "https://www.glims.org/RGI/rgi60_files/00_rgi60_regions.zip"
gdf = gpd.read_file(shapefile_url)
gdf = gdf.dissolve(by="RGI_CODE", as_index=False)
regions = regionmask.from_geopandas(
    gdf, numbers="RGI_CODE", names="FULL_NAME", abbrevs="WGMS_CODE"
)

# mask
mask = regions.mask(ds2)
plot.projected_map(mask)

map

malmans2 commented 6 months ago

I'll wait for your feedback to continue working on this. In summary:

yoniv1 commented 6 months ago

Hi @malmans2

malmans2 commented 6 months ago

Importing a publicly available shapefile should not be a problem (you already do it in your other template, and many other evaluators already do the same). Furthermore, if you use the url from glims I showed you, you can explicitly use the same version of the data downloaded from the CDS (6.0). The source of data is clearly defined in the CDS documentation of your dataset (this is how I found out that you don't need to infer the region shapes).

Anyways, it's your call. We only store templates in this repo, you can change it as you like it in the final version you are going to submit.

I'll add the cumultative mass change timeseries to this template, although it's pretty much just copy/paste from the other template.

malmans2 commented 6 months ago

Done. If you prefer to use a different masking, just change the mask variable. I added a quick and dirty plot: image

Let me know if it's OK now.

yoniv1 commented 6 months ago

Hi Mattia,

This is indeed what I'm after, but how would I need to adjust the code to use my own rgi_raster instead of the GTN-G shapefile for the masking of the regions (just in case)?

Thanks

malmans2 commented 6 months ago

You can compute the boolean variable mask the way you like. The only constraint is that it must have dimension (region, latitude, longitude).

I don't fully understand your implementation. For example, why do you need to specify an intermediate grid if you are going to use nearest neighbour?

This is how I would change the few lines after # Mask data, but maybe I'm overlooking something:

import xesmf as xe
import xarray as xr

# Mask data
regions = df["RGIID"].str[6:8].astype(int)
da = regions.to_xarray().assign_coords(
    lon=df["CENLON"].to_xarray(),
    lat=df["CENLAT"].to_xarray(),
)
regridder = xe.Regridder(da, ds, locstream_in=True, method="nearest_s2d")
mask_2d = regridder(da)
mask = xr.concat(
    [(mask_2d == region).expand_dims(region=[region]) for region in regions.unique()],
    "region",
)
ds = ds.where(mask)
yoniv1 commented 6 months ago

Thanks, Mattia! It indeed works. I have now an almost complete notebook. The only thing that is weird is a warning that I get:

# Determine volume under- or overestimation
vol_est = np.zeros(np.max(ds2["region"].values))
for i in range(0,np.max(ds2["region"].values)):
    y_interp = scipy.interpolate.interp1d(ds2["time"].values,ds2["Cumulative"].values[:,i])
    vol_est[i] = 1.091*(y_interp(weighted_year.values[i]) - ds2["Cumulative"].values[:,i][np.where(ds2["time"].values==2000)])
[/data/wp5/.tmp/ipykernel_11583/2458034537.py:32](http://localhost:5678/data/wp5/.tmp/ipykernel_11583/2458034537.py#line=31): DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
  vol_est[i] = 1.091*(y_interp(weighted_year.values[i]) - ds2["Cumulative"].values[:,i][np.where(ds2["time"].values==2000)])

Here is the new notebook where everything should work except this error: glacier_distribution (4).zip

malmans2 commented 6 months ago

I'll implement this in the template. There's no need to use a for loop to compute interpolated variables as you are using arrays.

In the meantime, the first 2 bullet points here are important. Please let us know if the template is correct.

Finally, I suggest to start from the template and add comments/tune plots. It looks like you didn't implement most of the code we revised. The code in the template is meant to be optimised, readable, and in line with the code written by other evaluators.

malmans2 commented 6 months ago

The template is ready:

Please let me know if we can close this issue.

yoniv1 commented 6 months ago

Hi Mattia,

I looked into detail to the differences of our both lists for the bar plot (gdf["year"].values). There are a few differences (if 9999 are given for the month and day, I just assume it to be 0101, i.e. 1st of January):

Now for the average and area-weighted average: The differences arise because (1) the slight differences above, because (2) I make use of decimal years to calculate the averages, while you still make use of the same as list as used for the bar plot (rounded years without taking into account the month and day to calculate decimal years) (for example, for the first index (BGNDATE = 20090703 and ENDDATE = -9999999), your year would be 2009 to be inserted into the calculations, while mine would be 2009.50376454), and (3) also your arithmetic mean is lower than mine because you average the grouped regional means, rather than taking the average of the dataset as a whole. Because the number of glaciers varies in each grouped region, the arithmetic mean over the groups differs from the mean over the complete dataset:

print(mean.mean())
print(gdf["year"].mean())
2000.7232804219457
2002.1790339395343

I think the combination of those things explains the differences.

malmans2 commented 6 months ago

Hi @yoniv1, There are some inconsistencies in what you are describing.

Let's start with the first:

if 9999 are given for the month and day, I just assume it to be 0101, i.e. 1st of January

I do the same after you requested it. See .replace("99", "01")

Index 27105 to 39567: BGNDATE is here 20049999 and ENDDATE 20069999. For some reason, your dataset returns 2004 and mine 2005. The correct is 2005 ((2004+2006)/2 = 2005).

Why? If it's true what you described above, the average of 20040101 and 20060101 is 20041231. Instead, you are doing an average of the year only. See pd.to_datetime(["20040101", "20060101"]).mean()

malmans2 commented 6 months ago

(Note that 2004 is a leap year)

yoniv1 commented 6 months ago

Hi Mattia,

Sorry for the trouble. Ah, I see! Okay in that case your method seems correct. It just looked intuitively strange to have a result of 2004 as the mean date of 20040101 and 20060101 (if you average the decimal years (2004.00 and 2006.00, it would be 2005.00 though, I think it is a special case due to leap year involved in the calculation). But with 2004 being a leap year, averaging the dates themselves would indeed result in 20041231, then rounded to 2004.

malmans2 commented 6 months ago

OK, no worries. The important thing is that it's clear what's causing the difference. What about issue (2)? I don't follow exactly your reasoning. I think the difference is also explained by the fact that leap years are taken into account. Does it make sense?

yoniv1 commented 6 months ago

Yes, the same happens for 1999-01-01 and 2003-01-01. The average date is 2000-31-12, rounded to 2000. Also 1962-01-01 and 1976-01-01 yields 1968-12-31 rounded to 1968. So I think indeed the leap years caused the trouble.

malmans2 commented 6 months ago

OK cool. Can I close this issue or do you need anything else?

yoniv1 commented 6 months ago

I think it is resolved. Apart from the fact that maybe you have a more efficient way of transforming the average dates into decimal years to insert in the area-weighted average calculation?

malmans2 commented 6 months ago

Ohh I didn't see that, then it's not true that your point (2) here is due to leap years. Sorry but I deal with very different use cases every day.

I'll implement that in the template.

malmans2 commented 6 months ago

It's implemented now. Note that I take into account leap years to compute the decimal years.

Please take a look and let me know.

yoniv1 commented 6 months ago

It's implemented now. Note that I take into account leap years to compute the decimal years.

Please take a look and let me know.

Hi Mattia,

Great, thanks for the work. I'm really happy with the result! Can be closed.

malmans2 commented 6 months ago

FYI, I realised there was a typo in the template, although it doesn't make much difference. It's fixed now.

- gdf["decimal_year"] = year + (dates.dt.dayofyear - 1) / (364 + dates.dt.is_leap_year)
+ gdf["decimal_year"] = year + (dates.dt.dayofyear - 1) / (365 + dates.dt.is_leap_year)
yoniv1 commented 5 months ago

Hi Mattia,

For the glacier extent data (https://cds.climate.copernicus.eu/cdsapp#!/dataset/insitu-glaciers-extent?tab=overview), I'm trying to download the gridded data (for this notebook: https://github.com/bopen/c3s-eqc-toolbox-template/blob/main/notebooks/wp5/glacier_distribution.ipynb). I'm doing it by:

# Define request
request_extent_gridded = (
    "insitu-glaciers-extent",
    {
        "variable":"all",
        "product_type":"gridded",
        "format":"zip",
        "version":"6_0",
    },
)
# Get glacier extent gridded data
df2 = download.download_and_transform(*request_extent_gridded)

But it gives this error:

Exception: Client has not agreed to the required terms and conditions.. To access this resource, you first need to accept the termsof 'UZH Glaciers Extent licence' at https://cds.climate.copernicus.eu/cdsapp/#!/terms/licence-to-use-insitu-glaciers-extent

Is there a workaround for this? Does it also appear with you? Can you download the gridded data and plot it?

Thanks

malmans2 commented 5 months ago

No need to use a workaround, it's a supported feature. You need to accept the terms and conditions and use your own credentials (e.g., use your own cdsapirc).

Other users asked similar questions. For example, see this: https://github.com/bopen/c3s-eqc-toolbox-template/issues/94#issuecomment-1693602743

Let me know if you're able to use your credentials.

yoniv1 commented 5 months ago

Sorry, I don't know what to do. I accepted the terms with my account and now I am on the CDS page with my CDS API url and key. What is the file $HOME/.cdsapirc in my personal directory on the VM?

malmans2 commented 5 months ago

Create this file on the VM: /data/wp5/verhaegen_yoni/.cdsapirc The content of the file is described here: https://cds.climate.copernicus.eu/api-how-to Basically, .cdsapirc contains the key and url info to access the CDS

When you are done, add these lines on top of your notebook and re-run your notebook:

import os
os.environ["CDSAPI_RC"] = os.path.expanduser("~/verhaegen_yoni/.cdsapirc")

Does it work?

yoniv1 commented 5 months ago

Ah, ok! I didn't realize it had to be created by myself, sorry. Yes, now it works, thanks!!