COSIMA / cosima-recipes

A cookbook of recipes (i.e., examples) for analysing ocean and sea ice model output
https://cosima-recipes.readthedocs.io
Apache License 2.0
44 stars 66 forks source link

New python kernels silently give different results in `Cross-contour_transport.ipynb` #327

Closed schmidt-christina closed 1 month ago

schmidt-christina commented 5 months ago

When using a kernel newer than analyis3-23.04, the cross contour transport calculated in this notebook is wrong (see black line compared to blue line where kernel analyis3-23.04 was used).

If anyone is keen to look into what has changed in the new python versions and packages that changes the results, that would be great and please assign yourself. Otherwise I will add a sentence about which kernel to use at the top of the notebook for now and look into it during the COSIMA hackathon.

image

edit: changed kernel from analysis3-22.07 to analysis3-23.04

adele-morrison commented 5 months ago

Is it possible to tell which step of the code changed? i.e. at which point do the answers diverge?

adele-morrison commented 5 months ago

And does someone have the list of which packages changed between analysis3-22.07 and analysis3-22.10?

anton-seaice commented 5 months ago

Are you getting any new warnings? My guess is there is some change in masking or handing of NaNs.

@adele-morrison : ~This is the literal answer to your Q.~ But @dsroberts can advise better I think.

EDIT: Diff for 23.04 to 23.07: conda_diff_04_to_07.txt

schmidt-christina commented 5 months ago

And does someone have the list of which packages changed between analysis3-22.07 and analysis3-22.10?

Sorry, I just looked into it again. The change happened between analysis3-23.04 and analysis3-23.07

@anton-seaice There is no warning using the newer kernels except for one on chunking affecting performance.

dsroberts commented 5 months ago

Hi @schmidt-christina, @anton-seaice . The only major issue I'm aware of between those two revisions is the change in default chunking that affects the ERA5 data set. I've not seen that change any numerical output, just how long it takes to get there. Would be interesting to look into, which notebook is it?

adele-morrison commented 5 months ago

This one: https://github.com/COSIMA/cosima-recipes/blob/main/DocumentedExamples/Cross-contour_transport.ipynb

schmidt-christina commented 5 months ago

Probably easiest to test in the notebook Hannah created when she came across this issue (it's a condensed version of the notebook Adele pointed to with only the necessary code for the selection of T/S on the contour and binning into density classes), but it won't allow me to upload it here. Should I send it via email @dsroberts ?

adele-morrison commented 5 months ago

@schmidt-christina, nice instructions here from @aidanheerdegen on how to share code: https://forum.access-hive.org.au/t/how-to-share-your-jupyter-notebook/692

Maybe others can help also if we can make it public.

schmidt-christina commented 5 months ago

I uploaded the notebook here

navidcy commented 5 months ago

Good catch @schmidt-christina and thanks for raising this issue! This is quite alarming tbh that something would silently start giving different results! Could you confirm that you don't have any warnings imposed to be silent (sometimes we do that in the beginning because warnings can overwhelm us -- I know I do at least!)?

navidcy commented 5 months ago

I think we should post this in the forum -- others might be getting strange results? I'm quite alarmed actually...

I was also getting different results in https://github.com/COSIMA/cosima-recipes/pull/322 but I (thought I) was convinced it was because the methods for computing meridional heat transport depend on the reference temperature used. I will revisit the PR with this in mind and try using an older conda env....

access-hive-bot commented 5 months ago

This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:

https://forum.access-hive.org.au/t/newest-conda-env-kernels-silently-change-model-output-analysis-results/2010/1

navidcy commented 5 months ago

@schmidt-christina (or @salvajoe, @willaguiar, @hrsdawson), it would be extremely useful if you could strip down the notebook to what we call Minimal Working Example (MWE). That is, the minimum code for which the problem arises using kernel A and not for kernel B. The notebook at the moment has a lot of things -- are all of them needed to showcase the problem? Perhaps they are, I don't know. If they are then leave them!

[Usually this process of stripping down to a minimal working example helps a lot pin-pointing the issue and, sometimes, even finding the solution!]

I know I'm asking you to do work but could you please? This seems an important issue. Results that change silently after upgrading packages are very very sneaky and insidious and could result for a lot of struggles by some others who could be probably thinking "oh.. this must be my fault for not getting it right...". Or this could result to publishing wrong results!

On another note. What do lines:

import logging
logging.captureWarnings(True)
logging.getLogger('py.warnings').setLevel(logging.ERROR)

do? Do they suppress warnings? If we remove them do you get any meaningful warning perhaps?

dsroberts commented 5 months ago

Hi all. I've managed to reproduce the incorrect behaviour in the analysis3/23.04 kernel. The issue seems to be that the sortby() function used in line 17 of cell 11 results in different output between the two kernels: contour_index_2304_sorted vol_trans_across_contour.sortby(contour_ordering)[0].plot() analysis3/23.04 contour_index_2307_sorted vol_trans_across_contour.sortby(contour_ordering)[0].plot() analysis3/23.07 contour_index_2304_unsorted vol_trans_across_contour[0].plot() analysis3/23.04

So it seems like sortby() is doing nothing in the newer kernels. Unfortunately I don't know why that is yet, nor how to fix it. I just wanted to give you all a quick update before I head out for the rest of the day.

schmidt-christina commented 5 months ago

are all of them needed to showcase the problem?

All of the calculations in the notebook are needed (transport across an isobath binned into density classes is quite a long and complicated calculation). calculations are only done for one kernel and then we just load data saved from my 2023 study to compare against.

On another note. What do lines:

import logging
logging.captureWarnings(True)
logging.getLogger('py.warnings').setLevel(logging.ERROR)

do? Do they suppress warnings? If we remove them do you get any meaningful warning perhaps?

I tested that and removed the lines, but still no additional warnings.

navidcy commented 5 months ago

thanks @schmidt-christina for checking!

thanks @dsroberts for figuring out -- holding onto my chair for the what else might this investigation bring about and the end of the episode with dataarray.sortby()

dsroberts commented 5 months ago

OK, I think I've figured it out. The issue of why .sortby() doesn't do anything boils down to these lines here:

# get lat and lon along contour, useful for plotting later:
lat_along_contour = contour_ordering.y_ocean
lon_along_contour = contour_ordering.x_ocean
contour_index_array = np.arange(1, len(contour_ordering)+1)
# don't need the multi-index anymore, replace with contour count and save
lat_along_contour.coords['contour_index'] = contour_index_array
lon_along_contour.coords['contour_index'] = contour_index_array

tl;dr - the solution is to add .copy() onto the end of lat_along_contour=... and lon_along_contour= i.e.

lat_along_contour = contour_ordering.y_ocean.copy()
lon_along_contour = contour_ordering.x_ocean.copy()

The reason for this is a bit of a journey.

The difference in behaviour between the two conda environments comes down to the construction of the DataArrays on the right hand side of the equals sign on first 2 lines above. In 23.04, the DataArray created by running contour_ordering.y_ocean contains a copy of all of the objects needed from the original contour_ordering DataArray, including the pandas.MultiIndex object responsible for linking the contour_index coordinate to the data. In 23.07, this is not the case, the pandas.MultiIndex object associated with contour_ordering.y_ocean is a reference to that object in the contour_ordering DataArray. You can see this by using the is operator:

### 23.04
>>> lat_along_contour.get_index('contour_index') is contour_ordering.get_index('contour_index')
False
### 23.07
>>> lat_along_contour.get_index('contour_index') is contour_ordering.get_index('contour_index')
True

So when the following code runs:

lat_along_contour.coords['contour_index'] = contour_index_array

The original pandas.MultiIndex is deleted, as it believed to no longer be necessary, which means that in 23.07 the contour_ordering also loses its index, and has it replaced with a default pandas.RangeIndex. This explains why .sortby(contour_index) does nothing. The .sortby() function first attempts to align the contour_index DataArray to the vol_trans_across_contour DataArray, and it aligns the coordinates successfully in both cases, however, because none of the elements in the newly-created pandas.RangeIndex appears in the in the vol_trans_across_contour DataArray's pandas.MultiIndex in 23.07, it is unable to change the data along with the contour_index coordinate. Funnily enough, the actual np.lexsort() call within .sortby() does nothing in your case, the xr.align() call does all the work. The screenshot below demonstrates this, align

Note how the coordinates of contour_order and b have been changed after the alignment in both notebooks (analysis3-23.04 on the left, analysis3-23.07 on the right), but array on the right has not been changed.

Anyway, the solution is to add .copy() onto the end of lat_along_contour=... and lon_along_contour= i.e.

lat_along_contour = contour_ordering.y_ocean.copy()
lon_along_contour = contour_ordering.x_ocean.copy()

This a copy the objects within the contour_ordering.y_ocean DataArray, and therefore a copy of the original pandas.MultiIndex is created and assigned to lat_along_contour. This means the original index object in contour_ordering is not removed when the coords in lat_along_contour and lon_along_contour are modified, and xr.align() and therefore .sortby() works as expected. I think this behaviour is erroneous, as an object has been deleted whilst there are still active references to it, so I think its worth me submitting an issue on the xarray github, assuming it hasn't already been raised.

adele-morrison commented 5 months ago

Nice!

navidcy commented 5 months ago

Omg. Excellent detective work

AndyHoggANU commented 5 months ago

Nice work @dsroberts !!

dsroberts commented 5 months ago

So I've dug a little further and tried to make a reproducer, and it turns out that xarray does warn you that this might cause problems.

>>> c.coords['contour_index'] = np.arange(1,len(yc)+1)
/jobfs/113635825.gadi-pbs/ipykernel_1298317/4250819915.py:1: FutureWarning: updating coordinate 'contour_index' with a PandasMultiIndex would leave the multi-index level coordinates ['y_ocean', 'x_ocean'] in an inconsistent state. This will raise an error in the future. Use `.drop_vars(['contour_index', 'y_ocean', 'x_ocean'])` before assigning new coordinate values.
  c.coords['contour_index'] = np.arange(1,len(yc)+1)

And it does, the PandasMultiIndex is indeed in an inconsistent state. However, when I run import cosima_cookbook as cc, that warning disappears. This line has the effect of suppressing everything coming out of warnings.warn() (not just from cosima cookbook) unless a user knows to initialise the py.warnings logger. Which, to be fair, the original author of this notebook attempted to do, but did not add a handler, so the warnings were going into the void. Placing the following

import logging
console_handler = logging.StreamHandler(sys.stderr)
logger = logging.getLogger("py.warnings")
logger.addHandler(console_handler)

In the first cell of @schmidt-christina's notebook causes that warning to appear in a few places. Given this, I think the best place to raise an issue would be over on the COSMIA cookbook repo, as xarray did try to warn us, even on analysis3-23.04, except warnings were turned off. The fix suggested in the warning (.drop_vars(['contour_index', 'y_ocean', 'x_ocean'])) does work for the notebook i.e.:

# get lat and lon along contour, useful for plotting later:
lat_along_contour = contour_ordering.y_ocean.drop_vars(['contour_index', 'y_ocean', 'x_ocean'])
lon_along_contour = contour_ordering.x_ocean.drop_vars(['contour_index', 'y_ocean', 'x_ocean'])
contour_index_array = np.arange(1, len(contour_ordering)+1)
# don't need the multi-index anymore, replace with contour count and save
lat_along_contour.coords['contour_index'] = contour_index_array
lon_along_contour.coords['contour_index'] = contour_index_array

and is a more appropriate solution than using .copy(). Dropping vars removes any copy or reference to the pandas.MultiIndex object from contour_ordering so it can't get messed up when updating the coordinates.

navidcy commented 5 months ago

@schmidt-christina would you like to open a PR to fix this example?

navidcy commented 5 months ago

@adele-morrison I'm hoping we fix this before the hackathon; this is quite urgent. At the moment we have online an example that silently gives wrong results and we are potentially creating nightmares for people.

@dsroberts did an amazing detective job upstairs figuring the issue out

could someone open a PR and fix the example?

cc @willaguiar, @schmidt-christina, @salvajoe

navidcy commented 5 months ago

Even a PR that simply deletes the example is better than having something there which is wrong and makes people loose sleep or potentially deteriorates mental health...

If nobody has the capacity to fix it I can open a PR to delete the example and whenever we have the capacity we can put it back corrected.

adele-morrison commented 5 months ago

Even a PR that simply deletes the example is better

Hmmm, this seems a little drastic. What about a PR that says analysis3-23.04 or earlier is needed? And then we can fix properly at the hackathon?

navidcy commented 5 months ago

A caveat message is also a good solution! And if we do that, then we can also link to this issue.

(Deleting doesn't mean much... the notebooks always stay in the Github history! We just don't want this to be advertised as an example that we present to people as "working out of the box" so that they take on and modify to continue their work.)

anton-seaice commented 5 months ago

I just ran this using analysis3-24.01 and there is a different error / it is failing earlier.

The method to extract the points along the contour from matplotlib is failing. There should a second 'tomato' coloured line in this plot from the Select the contour step.

Screenshot 2024-04-17 at 5 00 18 PM

adele-morrison commented 5 months ago

I just ran this using analysis3-24.01 and there is a different error / it is failing earlier.

@anton-seaice shall we raise a new issue?

anton-seaice commented 5 months ago

I just ran this using analysis3-24.01 and there is a different error / it is failing earlier.

@anton-seaice shall we raise a new issue?

I don't mind, but it would "hopefully" get fixed in the same PR so I wouldn't ?

adele-morrison commented 2 months ago

Regarding @anton-seaice's bug about finding the contour: