Open abby-baskind opened 2 years ago
I'm just going to go ahead and put the example without the time average here in the comments because i can't do anything else until I fix this issue, so I have nothing better to do. So here it is
ax_idx = 0 # index for your axes array
for name, ds_pco2 in dd_new_new.items():
ax = axarr_pco2.flat[ax_idx]
if np.max(ds_pco2.lon) > 180:
ds_pco2 = ds_pco2.isel(time=0).sel(x=slice(180,200)).mean('x',keep_attrs=True)
else:
ds_pco2 = ds_pco2.isel(time=0).sel(x=slice(0,20)).mean('x',keep_attrs=True)
pco2 = calc_PpCO2(ds_pco2)
sigma2 = calc_sigma2(ds_pco2)
meridionalsection_with_sigma2(ax,pco2,sigma2, clims=[0,1000],title=name,
label = 'Seawater Partial Pressure of CO2 (\u03BCatm)')
ax_idx += 1
Hi @abby-baskind I'll take a look later today. I suspect that the potential pCO2 calculation does not play nice with dask.
The plot here (without time-averaging) looks excellent!
hey @gmacgilchrist, here's a link to the code. i tried to annotate it well so you know what's going on https://github.com/abby-baskind/seniorthesis/blob/73b57110a67e8231ccb3c3e649a500e7ab3dccc5/notebooks/eventually_all_PCO2.ipynb
So a bit of an update on this issue. I tried to cut the dataset down a bit with some slices and linear interpolation over depth with xgcm.
Like I did previously, I took some slices and time averages first...
dd_pco2= {}
for name, ds_pco2 in dd_regrid.items():
if np.max(ds_pco2.lon) > 180:
dd_pco2[name] = ds_pco2.isel(time = slice(0, 239)).isel(y=slice(0,61)).sel(x=slice(180,200)).mean(['time', 'x'],keep_attrs=True)
else:
dd_pco2[name] = ds_pco2.isel(time = slice(0, 239)).isel(y=slice(0,61)).sel(x=slice(0,20)).mean(['time', 'x'],keep_attrs=True)
The output looks like this
'IPSL-CM6A-LR.gn.historical.Omon': <xarray.Dataset>
Dimensions: (bnds: 2, lev: 75, y: 61)
Coordinates:
* lev (lev) float32 0.5058 1.556 2.668 ... 5.698e+03 5.902e+03
lev_bounds (lev, bnds) float32 dask.array<chunksize=(75, 2), meta=np.ndarray>
* y (y) int64 -90 -89 -88 -87 -86 -85 ... -35 -34 -33 -32 -31 -30
Dimensions without coordinates: bnds
Data variables:
area (y) float64 dask.array<chunksize=(61,), meta=np.ndarray>
talk (lev, y) float64 dask.array<chunksize=(75, 61), meta=np.ndarray>
so (lev, y) float64 dask.array<chunksize=(75, 61), meta=np.ndarray>
fgco2 (y) float64 dask.array<chunksize=(61,), meta=np.ndarray>
thetao (lev, y) float64 dask.array<chunksize=(75, 61), meta=np.ndarray>
dissic (lev, y) float64 dask.array<chunksize=(75, 61), meta=np.ndarray>
Attributes:
regrid_method: bilinear,
Then I regridded lev
with xgcm to get 33 dimensions (I'm using dd_new_new_gr['CESM2-FV2.gr.historical.Omon'].lev
as the target because that dataset didn't shut the kernel down). Admittedly, this block of code likely could be written a little more gracefully, but I don't think the clunkiness of this block is a major reason the kernel is still shutting down.
grid = Grid(ds, coords={'Z': {'center': 'lev'},
},
periodic=False
)
target_depth_levels=dd_new_new_gr['CESM2-FV2.gr.historical.Omon'].lev
dd_regrid_new = {}
for name, ds in dd_pco2.items():
dd_regrid_new[name] = {}
dd_regrid_new[name]['lev'] = target_depth_levels
dd_regrid_new[name]['y'] = ds.y
# dd_regrid_new[name]['x'] = ds.x
dd_regrid_new[name]['so'] = grid.transform(ds.so, 'Z', target_depth_levels, target_data=None, method='linear')
dd_regrid_new[name]['thetao'] = grid.transform(ds.thetao, 'Z', target_depth_levels, target_data=None, method='linear')
dd_regrid_new[name]['talk'] = grid.transform(ds.talk, 'Z', target_depth_levels, target_data=None, method='linear')
dd_regrid_new[name]['dissic'] = grid.transform(ds.dissic, 'Z', target_depth_levels, target_data=None, method='linear')
After all of this, this is what the output looks like, for dd_regrid_new['IPSL-CM6A-LR.gn.historical.Omon']
{'lev': <xarray.DataArray 'lev' (lev: 33)>
array([ 0., 10., 20., 30., 50., 75., 100., 125., 150., 200.,
250., 300., 400., 500., 600., 700., 800., 900., 1000., 1100.,
1200., 1300., 1400., 1500., 1750., 2000., 2500., 3000., 3500., 4000.,
4500., 5000., 5500.])
Coordinates:
* lev (lev) float64 0.0 10.0 20.0 30.0 ... 4e+03 4.5e+03 5e+03 5.5e+03
Attributes:
axis: Z
bounds: lev_bnds
long_name: ocean model level
positive: down
standard_name: olevel
units: m,
'y': <xarray.DataArray 'y' (y: 61)>
array([-90, -89, -88, -87, -86, -85, -84, -83, -82, -81, -80, -79, -78, -77,
-76, -75, -74, -73, -72, -71, -70, -69, -68, -67, -66, -65, -64, -63,
-62, -61, -60, -59, -58, -57, -56, -55, -54, -53, -52, -51, -50, -49,
-48, -47, -46, -45, -44, -43, -42, -41, -40, -39, -38, -37, -36, -35,
-34, -33, -32, -31, -30])
Coordinates:
* y (y) int64 -90 -89 -88 -87 -86 -85 -84 ... -35 -34 -33 -32 -31 -30,
'so': <xarray.DataArray 'so' (y: 61, lev: 33)>
dask.array<transpose, shape=(61, 33), dtype=float64, chunksize=(61, 33), chunktype=numpy.ndarray>
Coordinates:
* y (y) int64 -90 -89 -88 -87 -86 -85 -84 ... -35 -34 -33 -32 -31 -30
* lev (lev) float64 0.0 10.0 20.0 30.0 ... 4e+03 4.5e+03 5e+03 5.5e+03,
'thetao': <xarray.DataArray 'thetao' (y: 61, lev: 33)>
dask.array<transpose, shape=(61, 33), dtype=float64, chunksize=(61, 33), chunktype=numpy.ndarray>
Coordinates:
* y (y) int64 -90 -89 -88 -87 -86 -85 -84 ... -35 -34 -33 -32 -31 -30
* lev (lev) float64 0.0 10.0 20.0 30.0 ... 4e+03 4.5e+03 5e+03 5.5e+03,
'talk': <xarray.DataArray 'talk' (y: 61, lev: 33)>
dask.array<transpose, shape=(61, 33), dtype=float64, chunksize=(61, 33), chunktype=numpy.ndarray>
Coordinates:
* y (y) int64 -90 -89 -88 -87 -86 -85 -84 ... -35 -34 -33 -32 -31 -30
* lev (lev) float64 0.0 10.0 20.0 30.0 ... 4e+03 4.5e+03 5e+03 5.5e+03,
'dissic': <xarray.DataArray 'dissic' (y: 61, lev: 33)>
dask.array<transpose, shape=(61, 33), dtype=float64, chunksize=(61, 33), chunktype=numpy.ndarray>
Coordinates:
* y (y) int64 -90 -89 -88 -87 -86 -85 -84 ... -35 -34 -33 -32 -31 -30
* lev (lev) float64 0.0 10.0 20.0 30.0 ... 4e+03 4.5e+03 5e+03 5.5e+03}
But the kernel still shuts down when I run calc_PpCO2(dd_regrid_new['IPSL-CM6A-LR.gn.historical.Omon'])
.
Notably, my WiFi is on the fritz again (love that for me...), so I haven't done any debugging beyond this, and haven't tried saving any data locally. I also didn't publish this on my github before the WiFi quit on me...
TL;DR this issue is way more complicated than I thought
Yeah, the kernel is shutting down because at some point your code is trying to load far too much into memory. I can't tell exactly why that is happening at the moment. There is definitely a forced load happening in calc_Ppco2
but I'm not sure why that is taking place on even when you are feeding it a reduced dataset. I think it is probably because the machine tries to execute your other functions (e.g. the transformation) within the calculation of PpCO2.
I have posted an issue at PyCO2SYS to ask it to play nicer with xarray, but I don't think we should bank on this happening (it could be a simple fix, or it could require a complete overhaul of the package, I don't know).
I also need to investigate running your code in a more distributed manner. At the moment you are doing everything (as far as I can tell) on a single processor - this could be the source of the memory issues. We might be able to do things in a more distributed manner, which could help, but I have not yet tried to do this on Pangeo, so will have to investigate.
There is a "quick and dirty" option here, which is to use for
loops to do the necessary calculations at each time increment (and/or longitude point or depth level), and then averaging afterwards. This is the usual approach in languages like Matlab, where there is a maximum (and rather small) amount of loaded memory that it can handle at any given time. We generally try to avoid this in python, but sometimes it is necessary. So the pseudo-code syntax would be something like:
# loop through models
for model in models:
ds = dd[model]
# loop through times
for t in times:
# perform calculation at each time and assign into a new, temporary variable
tmp[t] = function(ds.sel(time=time)
# calculate the mean over time
answer = tmp.mean(time)
In this configuration, you are only ever loading the dataset at a single timestep, during the function
.
As a general comment, one of your issues might be that you are creating lots of different objects with lots of different names (dd
, dd_new
, dd_new_new
, dd_regrid_new
). That means that each of those objects is still being kept in memory. Now I don't think this is the major issue that you're facing, because of the lazy computation aspect of xarray (none of the data associated with these objects are actually loaded into memory), but it is generally better practice to only keep the objects that you actively need in your script, overwriting those that are no longer necessary.
hey @gmacgilchrist, I implemented all your suggestions and tried it out with just one model (specifically one of the models that shut down the kernel). The relevant code looks like this
ds = dd['IPSL-CM6A-LR.gn.historical.Omon']
tmp = {}
for t in ds.time[0:239]:
pco2 = calc_PpCO2(ds.sel(time=t).sel(x=slice(0,20)).mean('x',keep_attrs=True))
tmp[str(t.values)]=pco2
# print(t)
mn = np.nanmean(list(tmp.values()), axis=0)
mn
The output for the 20yr average more or less looks good. The attributes of the dataset got lost, which is inconvenient but i can certainly work around it.
array([[ nan, nan, nan, ..., 303.0522884 ,
298.3220031 , 292.51188085],
[ nan, nan, nan, ..., 303.06333186,
298.33279287, 292.52428003],
[ nan, nan, nan, ..., 303.09897837,
298.363844 , 292.55589225],
...,
[ nan, nan, nan, ..., nan,
nan, nan],
[ nan, nan, nan, ..., nan,
nan, nan],
[ nan, nan, nan, ..., nan,
nan, nan]])
I made a super rough plot and it looks good.
So good news and bad news. The good news is that the kernel did not shut down. I kept an eye on the memory while it ran and it never used more than 3.5 GB. The bad news: it took over an hour to run, just for the one model.
As I am writing this comment I'm running a block of code for all of the models
pco2_mn = {}
for name, ds in dd.items():
tmp = {}
for t in ds.time[0:239]:
if np.max(ds.lon) > 180:
pp = calc_PpCO2(ds.sel(time=t).sel(x=slice(180,200)).mean('x',keep_attrs=True))
else:
pp = tmp[str(t.values)] = calc_PpCO2(ds.sel(time=t).sel(x=slice(0,20)).mean('x',keep_attrs=True))
tmp[str(t.values)] = pp
pco2_mn[name] = np.nanmean(list(tmp.values()), axis=0)
It will probably take >1hr to run, so no matter what the output it (even if it looks wrong) I'm saving it
Great work! That's great that it's going through without crashing the kernel - that's a good start. The slowness is likely a result of looping in time - that's why we generally avoid that but sometimes its a necessary evil. This is also probably why you are losing your attribute data. If you are careful to keep track of what is being placed where during the for loop, you should be able to copy the attribute data across into your new output.
I will take a look at ways of improving the speed.
@gmacgilchrist If we can find a way to improve the speed that would be great. When I was running it last night my wifi crashed. So I'm running it again this morning, and it's been running for 3 hours
TL;DR My kernel is shutting down. I think I need to regrid over
lev
so everything has 45lev
dimensions instead of 75. How do I do that?So my current regridding method is very similar to what @jbusecke helped me with. The only difference is that I reassigned the
y
coordinates, soy
would be -90 to 90, instead of 0 to 180But I'm having an issue with the kernel shutting down when I calculate PpCO2, for models with 75
lev
dimensions. Within the function specifically, the issue comes up withpyco2.sys
(i.e. I can successfully getp
andinsitutemp
but notresults
). The issue doesn't come up when I don't average over time (I have an example of the output without the time average, so if you want to see that for reference just let me know).I tried to make the datasets a bit smaller by averaging over
time
andx
first and selecting only a subset ofy
.So if i implement this for a dataset like
CanESM5-CanOE.gn.historical.Omon
, which has dimensionslev: 45, y: 180
,calc_PpCO2(dd_pco2['CanESM5-CanOE.gn.historical.Omon'])
runs without shutting down the kernel (I'm using the large 16GB kernel). It does take forever to run, so that's inconvenient, but I can deal.But for a dataset like
IPSL-CM6A-LR.gn.historical.Omon
, which has dimensionslev: 75, y: 180
,calc_PpCO2(dd_pco2['IPSL-CM6A-LR.gn.historical.Omon'])
uses more than 16GB and shuts down the kernel. This also happens forUKESM1-0-LL.gn.historical.Omon
andCNRM-ESM2-1.gn.historical.Omon
, which also have dimensionslev: 75, y: 180
.I think the datasets with 75
lev
dimensions are a bit too big for the kernel and shut down the kernel. Maybe I'm wrong, but I think I can solve this by regridding over a z dimension to have 45 (or fewer)lev
dimensions. I looked through the xESMF documentation and some GitHub issues for a way to do this and found no solutions that made sense (I think some of the issues I looked through proposed solutions but they were very confusing for a baby coder like me).So my questions. Do you think the extra
lev
dimensions are the reason the kernel is shutting down? And how do I regrid overlev
? Also, if you have some ideas for how I can get the other datasets that run but take forever to be more efficient, I would appreciate that, but that's less important.P.S. @gmacgilchrist This is the issue I mentioned on Wednesday that I was having with my polar stereographic projections, but now it's happening with my slices. So I love that for me...