abby-baskind / seniorthesis

0 stars 2 forks source link

sad because the kernel is shutting down :'( #15

Open abby-baskind opened 2 years ago

abby-baskind commented 2 years ago

TL;DR My kernel is shutting down. I think I need to regrid over lev so everything has 45 lev dimensions instead of 75. How do I do that?

So my current regridding method is very similar to what @jbusecke helped me with. The only difference is that I reassigned the y coordinates, so y would be -90 to 90, instead of 0 to 180

targetgrid_ds = xe.util.grid_global(1.0, 1.0)
new_y = targetgrid_ds.y-90

dd_regrid={}
for name,item in dd_new_new_gn.items():
    regridder = xe.Regridder(item, targetgrid_ds, 'bilinear', 
                         periodic=True, ignore_degenerate=True)
    ds_regridded = regridder(item)
#     dd_regrid[name]=ds_regridded
    dd_regrid[name] = ds_regridded.assign_coords({'y' : new_y})
list(dd_regrid.keys())

But I'm having an issue with the kernel shutting down when I calculate PpCO2, for models with 75 lev dimensions. Within the function specifically, the issue comes up with pyco2.sys (i.e. I can successfully get p and insitutemp but not results). The issue doesn't come up when I don't average over time (I have an example of the output without the time average, so if you want to see that for reference just let me know).

def calc_PpCO2(ds):
    p = gsw.p_from_z(-1*ds['lev'], ds['y'], geo_strf_dyn_height=0, sea_surface_geopotential=0)
    insitutemp = gsw.t_from_CT(ds['so'], ds['thetao'], p)  # 0 is just a filler rn 
    conversion =  1e6/1035
    results = pyco2.sys(par1=ds.talk*conversion,par2=ds.dissic*conversion,par1_type=1,par2_type=2,
                        pressure_out=0, temperature_out = ds.thetao, pressure = p, 
                        temperature = insitutemp)
    return ds['talk'].copy(data=results['pCO2_out'])

I tried to make the datasets a bit smaller by averaging over time and x first and selecting only a subset of y.

dd_pco2= {}
for name, ds_pco2 in dd_new_new.items():

    # so this if statement is because the gn and gr datasets have different longitudes 
    # so this is just a dumb hack to get around that

    if np.max(ds_pco2.lon) > 180:
        dd_pco2[name] = ds_pco2.isel(time = slice(0, 239)).isel(y=slice(0,61)).sel(x=slice(180,200)).mean(['time', 'x'],keep_attrs=True)
    else:
        dd_pco2[name] = ds_pco2.isel(time = slice(0, 239)).isel(y=slice(0,61)).sel(x=slice(0,20)).mean(['time', 'x'],keep_attrs=True)

So if i implement this for a dataset like CanESM5-CanOE.gn.historical.Omon, which has dimensions lev: 45, y: 180, calc_PpCO2(dd_pco2['CanESM5-CanOE.gn.historical.Omon']) runs without shutting down the kernel (I'm using the large 16GB kernel). It does take forever to run, so that's inconvenient, but I can deal.

But for a dataset like IPSL-CM6A-LR.gn.historical.Omon, which has dimensions lev: 75, y: 180, calc_PpCO2(dd_pco2['IPSL-CM6A-LR.gn.historical.Omon']) uses more than 16GB and shuts down the kernel. This also happens for UKESM1-0-LL.gn.historical.Omon and CNRM-ESM2-1.gn.historical.Omon, which also have dimensions lev: 75, y: 180.

I think the datasets with 75 lev dimensions are a bit too big for the kernel and shut down the kernel. Maybe I'm wrong, but I think I can solve this by regridding over a z dimension to have 45 (or fewer) lev dimensions. I looked through the xESMF documentation and some GitHub issues for a way to do this and found no solutions that made sense (I think some of the issues I looked through proposed solutions but they were very confusing for a baby coder like me).

So my questions. Do you think the extra lev dimensions are the reason the kernel is shutting down? And how do I regrid over lev? Also, if you have some ideas for how I can get the other datasets that run but take forever to be more efficient, I would appreciate that, but that's less important.

P.S. @gmacgilchrist This is the issue I mentioned on Wednesday that I was having with my polar stereographic projections, but now it's happening with my slices. So I love that for me...

abby-baskind commented 2 years ago

I'm just going to go ahead and put the example without the time average here in the comments because i can't do anything else until I fix this issue, so I have nothing better to do. So here it is

ax_idx = 0 # index for your axes array
for name, ds_pco2 in dd_new_new.items():
    ax = axarr_pco2.flat[ax_idx]

    if np.max(ds_pco2.lon) > 180:
        ds_pco2 = ds_pco2.isel(time=0).sel(x=slice(180,200)).mean('x',keep_attrs=True)
    else:
        ds_pco2 = ds_pco2.isel(time=0).sel(x=slice(0,20)).mean('x',keep_attrs=True)

    pco2 = calc_PpCO2(ds_pco2)
    sigma2 = calc_sigma2(ds_pco2)
    meridionalsection_with_sigma2(ax,pco2,sigma2, clims=[0,1000],title=name, 
                                  label = 'Seawater Partial Pressure of CO2 (\u03BCatm)')
    ax_idx += 1 
Screen Shot 2021-08-14 at 10 03 02 AM
gmacgilchrist commented 2 years ago

Hi @abby-baskind I'll take a look later today. I suspect that the potential pCO2 calculation does not play nice with dask.

The plot here (without time-averaging) looks excellent!

abby-baskind commented 2 years ago

hey @gmacgilchrist, here's a link to the code. i tried to annotate it well so you know what's going on https://github.com/abby-baskind/seniorthesis/blob/73b57110a67e8231ccb3c3e649a500e7ab3dccc5/notebooks/eventually_all_PCO2.ipynb

abby-baskind commented 2 years ago

So a bit of an update on this issue. I tried to cut the dataset down a bit with some slices and linear interpolation over depth with xgcm.

Like I did previously, I took some slices and time averages first...

dd_pco2= {}
for name, ds_pco2 in dd_regrid.items():

    if np.max(ds_pco2.lon) > 180:
        dd_pco2[name] = ds_pco2.isel(time = slice(0, 239)).isel(y=slice(0,61)).sel(x=slice(180,200)).mean(['time', 'x'],keep_attrs=True)
    else:
        dd_pco2[name] = ds_pco2.isel(time = slice(0, 239)).isel(y=slice(0,61)).sel(x=slice(0,20)).mean(['time', 'x'],keep_attrs=True)

The output looks like this

'IPSL-CM6A-LR.gn.historical.Omon': <xarray.Dataset>
 Dimensions:     (bnds: 2, lev: 75, y: 61)
 Coordinates:
   * lev         (lev) float32 0.5058 1.556 2.668 ... 5.698e+03 5.902e+03
     lev_bounds  (lev, bnds) float32 dask.array<chunksize=(75, 2), meta=np.ndarray>
   * y           (y) int64 -90 -89 -88 -87 -86 -85 ... -35 -34 -33 -32 -31 -30
 Dimensions without coordinates: bnds
 Data variables:
     area        (y) float64 dask.array<chunksize=(61,), meta=np.ndarray>
     talk        (lev, y) float64 dask.array<chunksize=(75, 61), meta=np.ndarray>
     so          (lev, y) float64 dask.array<chunksize=(75, 61), meta=np.ndarray>
     fgco2       (y) float64 dask.array<chunksize=(61,), meta=np.ndarray>
     thetao      (lev, y) float64 dask.array<chunksize=(75, 61), meta=np.ndarray>
     dissic      (lev, y) float64 dask.array<chunksize=(75, 61), meta=np.ndarray>
 Attributes:
     regrid_method:  bilinear,

Then I regridded lev with xgcm to get 33 dimensions (I'm using dd_new_new_gr['CESM2-FV2.gr.historical.Omon'].lev as the target because that dataset didn't shut the kernel down). Admittedly, this block of code likely could be written a little more gracefully, but I don't think the clunkiness of this block is a major reason the kernel is still shutting down.

grid = Grid(ds, coords={'Z': {'center': 'lev'},
                        },
                periodic=False
            )
target_depth_levels=dd_new_new_gr['CESM2-FV2.gr.historical.Omon'].lev
dd_regrid_new = {}

for name, ds in dd_pco2.items():
    dd_regrid_new[name] = {}
    dd_regrid_new[name]['lev'] = target_depth_levels
    dd_regrid_new[name]['y'] = ds.y
#     dd_regrid_new[name]['x'] = ds.x
    dd_regrid_new[name]['so'] = grid.transform(ds.so, 'Z', target_depth_levels, target_data=None, method='linear')
    dd_regrid_new[name]['thetao'] = grid.transform(ds.thetao, 'Z', target_depth_levels, target_data=None, method='linear')
    dd_regrid_new[name]['talk'] = grid.transform(ds.talk, 'Z', target_depth_levels, target_data=None, method='linear')
    dd_regrid_new[name]['dissic'] = grid.transform(ds.dissic, 'Z', target_depth_levels, target_data=None, method='linear')

After all of this, this is what the output looks like, for dd_regrid_new['IPSL-CM6A-LR.gn.historical.Omon']

{'lev': <xarray.DataArray 'lev' (lev: 33)>
 array([   0.,   10.,   20.,   30.,   50.,   75.,  100.,  125.,  150.,  200.,
         250.,  300.,  400.,  500.,  600.,  700.,  800.,  900., 1000., 1100.,
        1200., 1300., 1400., 1500., 1750., 2000., 2500., 3000., 3500., 4000.,
        4500., 5000., 5500.])
 Coordinates:
   * lev      (lev) float64 0.0 10.0 20.0 30.0 ... 4e+03 4.5e+03 5e+03 5.5e+03
 Attributes:
     axis:           Z
     bounds:         lev_bnds
     long_name:      ocean model level
     positive:       down
     standard_name:  olevel
     units:          m,
 'y': <xarray.DataArray 'y' (y: 61)>
 array([-90, -89, -88, -87, -86, -85, -84, -83, -82, -81, -80, -79, -78, -77,
        -76, -75, -74, -73, -72, -71, -70, -69, -68, -67, -66, -65, -64, -63,
        -62, -61, -60, -59, -58, -57, -56, -55, -54, -53, -52, -51, -50, -49,
        -48, -47, -46, -45, -44, -43, -42, -41, -40, -39, -38, -37, -36, -35,
        -34, -33, -32, -31, -30])
 Coordinates:
   * y        (y) int64 -90 -89 -88 -87 -86 -85 -84 ... -35 -34 -33 -32 -31 -30,
 'so': <xarray.DataArray 'so' (y: 61, lev: 33)>
 dask.array<transpose, shape=(61, 33), dtype=float64, chunksize=(61, 33), chunktype=numpy.ndarray>
 Coordinates:
   * y        (y) int64 -90 -89 -88 -87 -86 -85 -84 ... -35 -34 -33 -32 -31 -30
   * lev      (lev) float64 0.0 10.0 20.0 30.0 ... 4e+03 4.5e+03 5e+03 5.5e+03,
 'thetao': <xarray.DataArray 'thetao' (y: 61, lev: 33)>
 dask.array<transpose, shape=(61, 33), dtype=float64, chunksize=(61, 33), chunktype=numpy.ndarray>
 Coordinates:
   * y        (y) int64 -90 -89 -88 -87 -86 -85 -84 ... -35 -34 -33 -32 -31 -30
   * lev      (lev) float64 0.0 10.0 20.0 30.0 ... 4e+03 4.5e+03 5e+03 5.5e+03,
 'talk': <xarray.DataArray 'talk' (y: 61, lev: 33)>
 dask.array<transpose, shape=(61, 33), dtype=float64, chunksize=(61, 33), chunktype=numpy.ndarray>
 Coordinates:
   * y        (y) int64 -90 -89 -88 -87 -86 -85 -84 ... -35 -34 -33 -32 -31 -30
   * lev      (lev) float64 0.0 10.0 20.0 30.0 ... 4e+03 4.5e+03 5e+03 5.5e+03,
 'dissic': <xarray.DataArray 'dissic' (y: 61, lev: 33)>
 dask.array<transpose, shape=(61, 33), dtype=float64, chunksize=(61, 33), chunktype=numpy.ndarray>
 Coordinates:
   * y        (y) int64 -90 -89 -88 -87 -86 -85 -84 ... -35 -34 -33 -32 -31 -30
   * lev      (lev) float64 0.0 10.0 20.0 30.0 ... 4e+03 4.5e+03 5e+03 5.5e+03}

But the kernel still shuts down when I run calc_PpCO2(dd_regrid_new['IPSL-CM6A-LR.gn.historical.Omon']).

Notably, my WiFi is on the fritz again (love that for me...), so I haven't done any debugging beyond this, and haven't tried saving any data locally. I also didn't publish this on my github before the WiFi quit on me...

TL;DR this issue is way more complicated than I thought

gmacgilchrist commented 2 years ago

Yeah, the kernel is shutting down because at some point your code is trying to load far too much into memory. I can't tell exactly why that is happening at the moment. There is definitely a forced load happening in calc_Ppco2 but I'm not sure why that is taking place on even when you are feeding it a reduced dataset. I think it is probably because the machine tries to execute your other functions (e.g. the transformation) within the calculation of PpCO2.

I have posted an issue at PyCO2SYS to ask it to play nicer with xarray, but I don't think we should bank on this happening (it could be a simple fix, or it could require a complete overhaul of the package, I don't know).

I also need to investigate running your code in a more distributed manner. At the moment you are doing everything (as far as I can tell) on a single processor - this could be the source of the memory issues. We might be able to do things in a more distributed manner, which could help, but I have not yet tried to do this on Pangeo, so will have to investigate.

There is a "quick and dirty" option here, which is to use for loops to do the necessary calculations at each time increment (and/or longitude point or depth level), and then averaging afterwards. This is the usual approach in languages like Matlab, where there is a maximum (and rather small) amount of loaded memory that it can handle at any given time. We generally try to avoid this in python, but sometimes it is necessary. So the pseudo-code syntax would be something like:

# loop through models
for model in models:
      ds = dd[model]
      # loop through times
      for t in times:
            # perform calculation at each time and assign into a new, temporary variable
            tmp[t] = function(ds.sel(time=time)
# calculate the mean over time
answer = tmp.mean(time)

In this configuration, you are only ever loading the dataset at a single timestep, during the function.

As a general comment, one of your issues might be that you are creating lots of different objects with lots of different names (dd, dd_new, dd_new_new, dd_regrid_new). That means that each of those objects is still being kept in memory. Now I don't think this is the major issue that you're facing, because of the lazy computation aspect of xarray (none of the data associated with these objects are actually loaded into memory), but it is generally better practice to only keep the objects that you actively need in your script, overwriting those that are no longer necessary.

abby-baskind commented 2 years ago

hey @gmacgilchrist, I implemented all your suggestions and tried it out with just one model (specifically one of the models that shut down the kernel). The relevant code looks like this

ds = dd['IPSL-CM6A-LR.gn.historical.Omon']
tmp = {}

for t in ds.time[0:239]:
    pco2 = calc_PpCO2(ds.sel(time=t).sel(x=slice(0,20)).mean('x',keep_attrs=True)) 
    tmp[str(t.values)]=pco2
#     print(t)
mn = np.nanmean(list(tmp.values()), axis=0)
mn

The output for the 20yr average more or less looks good. The attributes of the dataset got lost, which is inconvenient but i can certainly work around it.

array([[         nan,          nan,          nan, ..., 303.0522884 ,
        298.3220031 , 292.51188085],
       [         nan,          nan,          nan, ..., 303.06333186,
        298.33279287, 292.52428003],
       [         nan,          nan,          nan, ..., 303.09897837,
        298.363844  , 292.55589225],
       ...,
       [         nan,          nan,          nan, ...,          nan,
                 nan,          nan],
       [         nan,          nan,          nan, ...,          nan,
                 nan,          nan],
       [         nan,          nan,          nan, ...,          nan,
                 nan,          nan]])

I made a super rough plot and it looks good.

Screen Shot 2021-08-22 at 4 26 46 PM

So good news and bad news. The good news is that the kernel did not shut down. I kept an eye on the memory while it ran and it never used more than 3.5 GB. The bad news: it took over an hour to run, just for the one model.

As I am writing this comment I'm running a block of code for all of the models

pco2_mn = {}
for name, ds in dd.items():
    tmp = {}
    for t in ds.time[0:239]:
        if np.max(ds.lon) > 180:
            pp = calc_PpCO2(ds.sel(time=t).sel(x=slice(180,200)).mean('x',keep_attrs=True))
        else:
            pp = tmp[str(t.values)] = calc_PpCO2(ds.sel(time=t).sel(x=slice(0,20)).mean('x',keep_attrs=True))
    tmp[str(t.values)] = pp
    pco2_mn[name] = np.nanmean(list(tmp.values()), axis=0)

It will probably take >1hr to run, so no matter what the output it (even if it looks wrong) I'm saving it

gmacgilchrist commented 2 years ago

Great work! That's great that it's going through without crashing the kernel - that's a good start. The slowness is likely a result of looping in time - that's why we generally avoid that but sometimes its a necessary evil. This is also probably why you are losing your attribute data. If you are careful to keep track of what is being placed where during the for loop, you should be able to copy the attribute data across into your new output.

I will take a look at ways of improving the speed.

abby-baskind commented 2 years ago

@gmacgilchrist If we can find a way to improve the speed that would be great. When I was running it last night my wifi crashed. So I'm running it again this morning, and it's been running for 3 hours