renaming this issue because i actually elaborated on it and it's GFDL being problematic

abby-baskind / seniorthesis

0 stars 2 forks source link

renaming this issue because i actually elaborated on it and it's GFDL being problematic #18

Closed abby-baskind closed 2 years ago

abby-baskind commented 2 years ago

@gmacgilchrist

# fig, ax = plt.subplots(nrows = 1, figsize=[18,12])
plt.rc('font', size = 20)
plt.rc('axes', titlesize= 30)    
plt.rc('axes', labelsize= 30)
plt.rc('figure', titlesize=30)
temp={}
yy = dd['CESM2-FV2.gr.historical.Omon'].lat.y

conversion = 3.1536e7 * 83.3
# 3.1536e7 seconds per year
# 83.3 mol C per kg -- 12 gC/mol

ax_idx = 0
for name, ds in dd.items():
    # maybe do mean over the entire circumference so just 
    # take out the where function and mean all x
    ar = -1*(ds.fgco2*A.areacello*conversion).mean(['x'], keep_attrs = True,skipna = True)
#     ar = -1*ds.fgco2.mean(['x'], keep_attrs=True, skipna = True)*conversion * A.areacello.mean(['x'])
#     ds.fgco2.where(np.logical_and(ds.lon<=200, ds.lon>=180), drop=True).isel(time = 0).mean('x', keep_attrs=True).plot(label = name, lw = 3)
    plt.plot(yy[10:60], ar[10:60], label = name, lw = 3)
    # adds ar to an array 
    temp[name] = ar
#     print(ds.fgco2.lat.y)
    plt.xlabel('Latitude')
    plt.ylabel('Surface Upward Flux of Total CO\u2082 (mol/m\u00b2/yr)')

mn = np.nanmean(list(temp.values()), axis=0)
sd = np.std(list(temp.values()), axis=0)
plt.plot(yy[10:60], mn[10:60], lw = 6, label = 'Ensemble mean', color = 'black')
plt.xlabel('Latitude')
plt.ylabel('Surface Upward Flux of Total CO\u2082 (molC/m\u00b2/yr)')

zr = xr.zeros_like(dd['CESM2-FV2.gr.historical.Omon'].fgco2)
zr.isel(x = 0)[10:60].plot(linestyle = 'dashed', color = 'black')
plt.xlabel('Latitude')
plt.ylabel('Surface Upward Flux of Total CO\u2082 (mol/m\u00b2/yr)')
# ax = plt.gca
# box = ax.get_position()
# ax.position([box.x0, box.y0, box.width * 0.8, box.height])
plt.legend(
#     bbox_to_anchor=(1.05, 1), 
    loc='upper right'
#     , borderaxespad=0.
)
plt.title('Circum Antarctic')
plt.rcParams["figure.figsize"] = (25,15)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-18-1d9b3b93d8a9> in <module>
     25     plt.ylabel('Surface Upward Flux of Total CO\u2082 (mol/m\u00b2/yr)')
     26 
---> 27 mn = np.nanmean(list(temp.values()), axis=0)
     28 sd = np.std(list(temp.values()), axis=0)
     29 plt.plot(yy[10:60], mn[10:60], lw = 6, label = 'Ensemble mean', color = 'black')

<__array_function__ internals> in nanmean(*args, **kwargs)

/srv/conda/envs/notebook/lib/python3.8/site-packages/numpy/lib/nanfunctions.py in nanmean(a, axis, dtype, out, keepdims)
    935 
    936     """
--> 937     arr, mask = _replace_nan(a, 0)
    938     if mask is None:
    939         return np.mean(arr, axis=axis, dtype=dtype, out=out, keepdims=keepdims)

/srv/conda/envs/notebook/lib/python3.8/site-packages/numpy/lib/nanfunctions.py in _replace_nan(a, val)
     98     if a.dtype == np.object_:
     99         # object arrays do not support `isnan` (gh-9009), so make a guess
--> 100         mask = np.not_equal(a, a, dtype=bool)
    101     elif issubclass(a.dtype.type, np.inexact):
    102         mask = np.isnan(a)

/srv/conda/envs/notebook/lib/python3.8/site-packages/xarray/core/common.py in __bool__(self)
    128 
    129     def __bool__(self: Any) -> bool:
--> 130         return bool(self.values)
    131 
    132     def __float__(self: Any) -> float:

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

abby-baskind commented 2 years ago

Hey @gmacgilchrist I looked into this issue a bit more. I still don't know how to fix it but I know a bit more about what's going on.

So the issue was when we multiplied ds.fgco2 by A.areacello and we get this area: ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().

I went through each of those models to see which was causing problems, and it's the GFDL models. Notably, I am using areacello from a GFDL model (I think ESM4 but it doesn't really matter because only one GFDL models had areacello. So I wrote this block of code to ignore GFDL models for now to make sure everything else works.

for name, ds in dd.items():
    # maybe do mean over the entire circumference so just 
    # take out the where function and mean all x
    if name != 'GFDL-CM4.gr.historical.Omon':
        if name != 'GFDL-ESM4.gr.historical.Omon':
            ar = -1*(ds.fgco2*A.areacello*conversion).mean(['x'], keep_attrs = True,skipna = True)

            plt.plot(yy[10:60], ar[10:60], label = name, lw = 3)

            temp[name] = ar

mn = np.nanmean(list(temp.values()), axis=0)
# sd = np.std(list(temp.values()), axis=0)
plt.plot(yy[10:60], mn[10:60], lw = 6, label = 'Ensemble mean', color = 'black')

This image does not include the GFDL models

So what's wrong with the GFDL models? A few insights into the data dd['GFDL-CM4.gr.historical.Omon'].fgco2.mean(['x'], keep_attrs = True,skipna = True)*conversion has dimensions y: 180. A.areacello.mean(['x'], keep_attrs = True,skipna = True) also has dimensions y: 180. However, the product of these two things has dimensions y: 162. So if we take a peek inside the product at the y values... ...-69.5, -68.5, -67.5, -66.5, -65.5, -64.5, -60.5, -59.5, -58.5, -57.5... We see that's it's missing -63.5, -62.5, and -61.5. It's probably missing some other latitude values but those are the ones missing in the Southern Ocean region.

At first I thought maybe there were some NaNs that were screwing everything up, but I'm using nanmean and tried fillna() and they either throw the ValueError or give a result with the wrong dimensions. So maybe that's not the root issue???

To summarize, I have no idea how to fix this.

jbusecke commented 2 years ago

The issue is probably that some values between area and data. Can you try to assign the area as a coordinate (after dropping the actual coordinates), like this:

data = data.assign_coords(areacello=A.areacello.reset_coords(drop=True))

abby-baskind commented 2 years ago

So I tried that, implementing it like this...

A = A.assign_coords(areacello=A.areacello.reset_coords(drop=True))
A

conversion = 3.1536e7 * 83.3
# 3.1536e7 seconds per year
# 83.3 mol C per kg -- 12 gC/mol

ax_idx = 0
for name, ds in dd.items():
        ar = -1*(ds.fgco2*A.areacello*conversion).mean(['x'], keep_attrs = True,skipna = True)
        ar
        plt.plot(yy, ar, label = name, lw = 3)

        temp[name] = ar

And got an error like this

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-36-c075a30e5a8c> in <module>
     22             ar = -1*(ds.fgco2*A.areacello*conversion).mean(['x'], keep_attrs = True,skipna = True)
     23             ar
---> 24             plt.plot(yy, ar, label = name, lw = 3)
     25 
     26             temp[name] = ar

/srv/conda/envs/notebook/lib/python3.8/site-packages/matplotlib/pyplot.py in plot(scalex, scaley, data, *args, **kwargs)
   2986 @_copy_docstring_and_deprecators(Axes.plot)
   2987 def plot(*args, scalex=True, scaley=True, data=None, **kwargs):
-> 2988     return gca().plot(
   2989         *args, scalex=scalex, scaley=scaley,
   2990         **({"data": data} if data is not None else {}), **kwargs)

/srv/conda/envs/notebook/lib/python3.8/site-packages/matplotlib/axes/_axes.py in plot(self, scalex, scaley, data, *args, **kwargs)
   1603         """
   1604         kwargs = cbook.normalize_kwargs(kwargs, mlines.Line2D)
-> 1605         lines = [*self._get_lines(*args, data=data, **kwargs)]
   1606         for line in lines:
   1607             self.add_line(line)

/srv/conda/envs/notebook/lib/python3.8/site-packages/matplotlib/axes/_base.py in __call__(self, data, *args, **kwargs)
    313                 this += args[0],
    314                 args = args[1:]
--> 315             yield from self._plot_args(this, kwargs)
    316 
    317     def get_next_color(self):

/srv/conda/envs/notebook/lib/python3.8/site-packages/matplotlib/axes/_base.py in _plot_args(self, tup, kwargs, return_kwargs)
    499 
    500         if x.shape[0] != y.shape[0]:
--> 501             raise ValueError(f"x and y must have same first dimension, but "
    502                              f"have shapes {x.shape} and {y.shape}")
    503         if x.ndim > 2 or y.ndim > 2:

ValueError: x and y must have same first dimension, but have shapes (180,) and (162,)

The error this time is different but it still stems from the product of A.areacello and fgco2 in the GFDL models having coordinates of y: 162

jbusecke commented 2 years ago

Ok so the problem must originate somewhere earlier. I would go back to where these areas are first loaded and then check at which point the length changes.

gmacgilchrist commented 2 years ago

Yeah, the issue here is a misalignment of coordinates : xarray will only allow variables that sit at the exact same coordinate positions to be operated on (added, multiplied, divided etc) together. So, the coordinates in the dataset A must be slightly misaligned with that of the data for one or both of the GFDL models - this much you already discovered (great work by the way).

The GFDL models are on the gr grid correct? Meaning that the interpolation was done "in house" before that data was uploaded to the cloud. So, it's quite possible that there are very slight differences between the y coordinates on the lat-lon grid. For example, if the dataset from which you use areacello has a y coordinate of -63.5 at some point, but for some reason the GFDL model at the same point has a y coordinate of -63.5000001, xarray will not multiply the corresponding values together. This coordinate would then be missing from your output array - here, you're seeing that this is the case for 18 points (180-162).

A couple of questions for @jbusecke :

Do you do anything to ensure alignment of coordinates for gr datasets in cmip6_preprocessing?
Does xarray allow you to define a "tolerance" for coordinate matching?

A couple of questions / suggestions for @abby-baskind :

Where does the dataset A come from? You implied it was saved from a GFDL model... it would therefore be surprising that it seems to match all of the models except GFDL.
You can always apply a step to assign_coords to make sure that everything matches in the way that you want. This is usually ill-advised because xarrays refusal to allow coordinate mismatching is one of its great strengths, but here it is pretty low risk since we know that all of the data are on a 1x1 lat-lon grid.

abby-baskind commented 2 years ago

@gmacgilchrist A is from GFDL-ESM4.gr.historical.Ofx. And fgco2 from both GFDL-CM4.gr.historical.Omon and GFDL-ESM4.gr.historical.Omon are gr, so they came regridded. So yes, very wacky that A does not work with exclusively the GFDL models.

One idea I just had is regridding either A, or GFDL-CM4.gr.historical.Omon & GFDL-ESM4.gr.historical.Omon the same way I regridded the gn models, since A did work with the gn models. As a reminder, here's how I did the regridding for gn models...

targetgrid_ds = xe.util.grid_global(1.0, 1.0)
targetgrid_ds['lon'] = targetgrid_ds['lon']+180
targetgrid_ds['lon_b'] = targetgrid_ds['lon_b']+180
newcoords = {'x':targetgrid_ds['lon'][0,:],'y':targetgrid_ds['lat'][:,0]}

dd_regrid={}
for name,item in dd_gn.items():
    regridder = xe.Regridder(item, targetgrid_ds, 'bilinear', 
                         periodic=True, ignore_degenerate=True)
    ds_regridded = regridder(item).assign_coords(**newcoords).chunk({'time':120,'lev':1})
    dd_regrid[name]=ds_regridded

I haven't thought about this method too critically, so it might not be a great idea, but it also might not be a terrible idea.

gmacgilchrist commented 2 years ago

I think it would be best to avoid any unnecessary regridding - both from a computational perspective and a "minimizing steps" perspective.

It is concerning and surprising that areacello from a GFDL model seems to have different coordinates to that same GFDL model. Can you try to strip back your code to simply load fgco2 and areacello for this model and multiply them together? Do you get the same error or coordinate reduction? If so, there is an error with the data. If not, then you are doing something to these data within your analysis that is misaligning them, and that would be worth tracking down.

jbusecke commented 2 years ago

It is concerning and surprising that areacello from a GFDL model seems to have different coordinates to that same GFDL model.

Concerning, yet totally not atypical 🤪

A = A.assign_coords(areacello=A.areacello.reset_coords(drop=True)) A

I think at this point the 'damage' to areacello has already been done (since it is already a coord of A). Can you go back and check the earliest time you load areacello?

Do you do anything to ensure alignment of coordinates for gr datasets in cmip6_preprocessing?

Yes, I have had many examples of such problems and that is why match_metrics has some logic built in to handle these occasions. I am not sure how much hassle it is to rewrite the code to use it, but it might be worth considering (see here for examples; basically you want to load the variables and data into separate dicts at the beginning and then parse the area as an early step). Alternatively you could use the hidden function _parse_metric, which handles all the logic under the hood, but it can be applied directly to a pair of dataset/metric.

Does xarray allow you to define a "tolerance" for coordinate matching?

I think there is something like that, but I have not used it. I usually opt to drop the coordinate labels (as suggested before).

gmacgilchrist commented 2 years ago

According to @abby-baskind, this issue went away of its own accord, so we may never know the source on this occasion. My guess is some error in the IO, or at some point in the processing of A.