i was so stoked for my new figs and then i realized they looked sus...

abby-baskind commented 3 years ago

@gmacgilchrist so i tried to throw together some new figs for dissic, talk, thetao, and so for models from the subset of['IPSL-CM6A-LR', 'CNRM-ESM2-1', 'CESM2', 'CanESM5', 'CanESM5-CanOE', 'MPI-ESM-1-2-HAM', 'UKESM1-0-LL', 'MPI-ESM1-2-LR', 'MPI-ESM1-2-HR', 'CESM2-WACCM', 'GISS-E2-1-G', 'GISS-E2-1-G-CC', 'MIROC-ES2L', 'ACCESS-ESM1-5','CESM2-WACCM-FV2', 'CESM2-FV2'] where grid_label = 'gn', experiment_id = 'historical', and table_id = 'Omon'. I was thinking the output would have 16 subfigures (one for each model), but that didn't happen. for example...

thetao

in the plots of thetao (not unique to thetao but a good example), there are 4 CanESM5. presumably, there is a slight difference between these models (maybe different runs of the same model/a different attribute that i'm not seeing). this is a terribly non-specific question but why? and how do i deal with it?

here's a block of code that might be relevant but here's the link to the whole code https://github.com/abby-baskind/seniorthesis/blob/9b70164a7e6bc418fb56f15f8cf09c233e282c3a/notebooks/figs_slices.ipynb

cat_thetao = col.search(source_id = ['IPSL-CM6A-LR', 'CNRM-ESM2-1', 'CESM2', 'CanESM5', 'CanESM5-CanOE', 
                             'MPI-ESM-1-2-HAM', 'UKESM1-0-LL', 'MPI-ESM1-2-LR', 'MPI-ESM1-2-HR', 
                             'CESM2-WACCM', 'GISS-E2-1-G', 'GISS-E2-1-G-CC', 'MIROC-ES2L', 'ACCESS-ESM1-5',
                             'CESM2-WACCM-FV2', 'CESM2-FV2'], variable_id= 'thetao', experiment_id= 'historical', table_id = 'Omon', grid_label = 'gn')
dd_thetao = cat_thetao.to_dataset_dict(
    zarr_kwargs={'consolidated': True, 'use_cftime':True},
    storage_options={'token': 'anon'},
    preprocess=combined_preprocessing,
    aggregate=False)

jbusecke commented 3 years ago

Hi @abby-baskind, you should be able to see if these are different at all by listing the keys of the dataset dictionary.

Can you paste the output of

print(list([k for k in dd_thetao.keys() if 'CanESM' in k]))

These keys are a combination of the unique attributes of each dataset (I just filtered for the CanESM models, but you can remove the if ... statement and see the keys for all models.

My suspicion: These are different members of the same model. Many of the CMIP6 models are run as an ensemble, which means that several experiments are branched off the original piControl run at different times, but then are run with the same forcing (here the historical forcing). This enables us to get an idea of how much of a given signal is internal variability (differences between members) vs. forced variability (the common signal between members).

If it turns out that my suspicion is right, you might for now just want to work with a single member per model (and add complexity later). You can find an example of how to do that with cmip6_preprocessing here.

abby-baskind commented 3 years ago

hey @jbusecke i'm getting an error importing postprocessing

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-8-3a65b39712db> in <module>
      8 import gsw
      9 warnings.filterwarnings("ignore")
---> 10 from cmip6_preprocessing.postprocessing import combine_datasets

ModuleNotFoundError: No module named 'cmip6_preprocessing.postprocessing'

not sure what to do about this since cmip6_preprocessing is already installed...

jbusecke commented 3 years ago

This means that the version of cmip6_preprocessing is old (you can confirm that by doing:

import cmip6_preprocessing
print(cmip6_preprocessing.__version__)

Updating dependencies can be quite a pain on the pangeo cloud, but I think this one here should be the easiest:

You can run this line in a terminal:

mamba update cmip6_preprocessing -y

Or run this from a notebook cell:

! mamba update cmip6_preprocessing -y

The ! enables you to execute shell commands from the notebook.

But here is the catch: You will have to do this at the beginning of every session, because each time you log out the environment in the cloud is restored. Once you have done this you also might need to restart the notebook itself for changes to appear.

To remember this, you could put something like this in an early cell of your notebook:

import cmip6_preprocessing
if cmip6_preprocessing.__version__ < '0.4.0':
    print('Manually update cmip6_pp with `mamba update cmip6_preprocessing -y`, and restart the notebook')

abby-baskind commented 3 years ago

thanks @jbusecke you are truly the best

jbusecke commented 3 years ago

Oh you might also want to plot the member_id (ds.attrs['variant_label']...I know why are there two words...its hella annoying) in the title so you can see which member is printed.

jbusecke commented 3 years ago

Sorry, that was too much coffee and a lagging mouse.

gmacgilchrist commented 3 years ago

@jbusecke Does this mean that the native install of cmip6_preprocessing on Pangeo has fallen behind? Can we get that updated easily?

I agree that the apparent multiple models could well be ensemble members of the same configuration. As Julius suggests, take a closer look at the specific details of the datasets to see how they differ. For example the full name associated with each of these models iterations should reveal the difference somewhere.

abby-baskind commented 3 years ago

so i used the strategy julius recommended to pick the first member of each model and printed out the source_id and variant_label for the selected models (this time it's so andthetao) and the output raised my hopes so high...

# output for so
CanESM5r3i1p1f1
CNRM-ESM2-1r3i1p1f2
MPI-ESM1-2-LRr9i1p1f1
GISS-E2-1-Gr6i1p1f1
IPSL-CM6A-LRr28i1p1f1
MIROC-ES2Lr7i1p1f2
MPI-ESM1-2-HRr3i1p1f1
ACCESS-ESM1-5r5i1p1f1
UKESM1-0-LLr6i1p1f3
CanESM5-CanOEr3i1p2f1
CESM2r2i1p1f1
MPI-ESM-1-2-HAMr2i1p1f1
CESM2-WACCM-FV2r1i1p1f1
CESM2-FV2r1i1p1f1
CESM2-WACCMr2i1p1f1
GISS-E2-1-G-CCr1i1p1f1

so with so much optimism, I plotted. and my hopes were destroyed.

so1

as you can probably tell, these plots are so sus, mostly notably because some of them include latitudes that don't exist. so i restricted my x axis to latitudes that do exist, hoping that values in those real latitudes would kind of look normal but nope

so2

so i'm not really sure what's going on now :/

gmacgilchrist commented 3 years ago

What did you change in the plotting procedure? The y-range should not have changed like this just from selecting a different subset of models...

Take a closer look at the data arrays themselves, and the corresponding coordinates. Where and how are they changing to become this wider range, i.e. at what point in your code is this error coming in?

abby-baskind commented 3 years ago

@gmacgilchrist i wasn't able to figure out when the error happened but somewhere along the way the y coordinates changed from latitude to (i think) indices. i started the whole thing over from scratch, avoided that error, and made some more reasonable plots. the only issue i'm having now is that a couple models are still goofy. CESM2-FV2 was funky for all the plots. For example, the right-most subplot on the bottom row...

pot_temp

i looked in the data array to see what y was, and it was wacky...

y (y) float64 -70.67 -70.14 -69.6 ... 4.767e+36 4.767e+36

none of the other models had this issue so that's weird

also, the GISS models gave some weird values for DIC

dic

I printed out some info on one of the GISS models (you can see it here https://github.com/abby-baskind/seniorthesis/blob/089b9e650c5a5864a2bdabb4220165cca2436345/notebooks/better_fig_slices.ipynb at In[38]) and nothing stood out as weird. is there a better way to get more info on this model to try to figure out what's wrong?

gmacgilchrist commented 3 years ago

@abby-baskind For the GISS models, try setting the colorbar limits to something reasonable. Sometimes, singular huge values can throw the limits off making the plot look crazy. That being said, the values here do look huge. I suspect that these data have been submitted incorrectly. Take a bit of a closer look, but if it really looks like the numbers are crazy, we can omit these models for now and follow up on getting the correct data.

Yes, the latitude variable for CESM looks strange indeed. Can you look at the latitude more closely - is it just a few values that are off, or something more foundational? It seems like you might be inadvertently making some changes to the coordinates of the y dimension (this would likely be the reason why they appeared to be lost in the plot you did before). I can't see where that might be in your code. It could also be an issue with the CESM output - @jbusecke might be able to shed some light on this.

General advice when there seems to be an issue is to strip back the complexity of your code to look in a bit more detail at the individual models and the variables therein. Perhaps in regards to the GISS model for example, can you make a simple plot of just the surface DIC? If the numbers are still all over the place, that's a clear indication that the data has not been stored correctly.

gmacgilchrist commented 3 years ago

Taking a bit of a closer look, I wonder if some of these issues (particularly as it concerns the latitude coordinate) may be arising in the combine_datasets function. I'm not familiar with that functionality, so perhaps @jbusecke can shed some light on whether it could be muddling up the coordinates for some models?

@abby-baskind You could take a look at this by looking just at the CESM model for which the issues appear, but before you do any combining of datasets. Can you just load that dataset and take a look - are the issues there from the start?

abby-baskind commented 3 years ago

so for the CESM model, i plotted all the members (not using combine_datasets) and still got some funky coordinates, so it seems combine_datasets isn't the issue.

I peeked into the array to figure out where the coordinates went awry. Here's y[90:100]: [7.596783e+01, 7.623199e+01, 7.644568e+01, 7.662436e+01, 7.677391e+01, 7.689791e+01, 4.984605e+35, 4.984605e+35, 4.984605e+35, 4.984605e+35] (so it goes wrong at index 96). I restricted the x range to [-90, 90], and at least until 20°N it looked reasonable. I'm currently not sure what's happening north of 20°N.

I'm still trying to figure out what's happening with the GISS models. I trying to figure out how to plot surface DIC like you suggested (so far, no luck), but in the meantime, I plotted all the GISS model members like I did with CESM (no combine_datasets) and resticted the colorbar to more reasonable values (between 0 and 5 mol/m^3 specifically). It still looks sus (which i suspected would happen since if you look at the plot I posted earlier, you can actually see that a lot of the DIC values are negative). honestly, I'm losing a lot of faith in the GISS models and they might just need to be scrapped for now

What might be worth trying is plotting these models forgrid_label = gr and/or table_id = Oyr instead of gn and Omon. Thoughts on that?

gmacgilchrist commented 3 years ago

Great investigative work, @abby-baskind ! 🕵🏼

Definitely looks like the coordinates are muddled with CESM. Perhaps this is something that can be looked at in cmip6_preprocessing @jbusecke ? This could well be solved by taking the gr grid for this model.

Also agree that GISS doesn't look great - best guess is that they've posted the wrong data (this happens sometimes). Suggest you scrap it for now. Surface data can be plotted using .isel(lev=0). Doing it in a proper map projection, using cartopy is not necessary for the purposes of exploration/debugging.

Definitely worth checking to see if gr or yearly data for these models looks any better 👍🏼 Otherwise just press ahead for now with the models for which the data looks decent.

jbusecke commented 3 years ago

Sorry just got to this. I second @gmacgilchrist, that we need to look at these models individually.

1) How are you slicing/selecting a longitudinal section? 2) I think most of your issues are related to not having a proper latitude coordinate. The lat values are 2d and if you do any kind of reduction, you might loose that. y can be an index or a nominal lat coordinate. Both of these are not at all ideal in the Arctic region. 3) What could also really screw with the data is if the latitude values have nans in them. A simple way to check is to plot the data without coordinates plt.imshow(ds.thetao.data) and see if that still looks wonky, if not, your problem is definitely with the coordinates.

I would suggest:

Short term investigation: Plot a map of the surface data, and check if the models that you are looking at are particularly wonky (e.g. distorted) if you plot them just against x/y).

Longer term solution: I think the cleanest way to do this is to horizontally regrid all gn models onto regular lon/lat grids. That way you can be sure that a slice of your latitude is actually representing the appropriate position on the globe. I am actually working on implementing that in cmip6_pp, so if you can wait a bit, this might solve itself.

jbusecke commented 3 years ago

Just to add to the above:

Definitely looks like the coordinates are muddled with CESM. Perhaps this is something that can be looked at in cmip6_preprocessing @jbusecke ? This could well be solved by taking the gr grid for this model.

CESM Definitely has nans in the lon/lat fields!

abby-baskind / seniorthesis

i was so stoked for my new figs and then i realized they looked sus... #7