coecms / xmhw

Xarray version of Marine Heatwaves code by Eric Olivier
https://xmhw.readthedocs.io/en/latest/
Apache License 2.0
22 stars 10 forks source link

Issue with xarray indexing in v0.9.0 #56

Closed CapeHobbit closed 1 year ago

CapeHobbit commented 1 year ago

Hi. Firstly, thanks for this excellent package, I am using it extensively in my Copernicus marine training.

Historically, I was using version 0.8.0 without issue. To do this, though, I had to "pin" to an older version of xarray, which is getting harder to maintain as part of a wider environment. I wanted to move up to 0.9.0 and free up xarray to iterate to a newer version, but I run into the following problem when I run the threshold routine;

Error

This may be related to how I treat the initial data as I average it into a 1 dimensional time series before I try to run the thresholding. However, I have tried this without the spatial averaging and get different errors related to dask.

My processing is as follows;

SST_data = xr.openmfdataset(glob.glob(os.path.join(os.getcwd(),'OSTIA*.nc'))) SST_spatial_average = SST_data.mean(dim='latitude').mean(dim='longitude') SST_time_series = SST_spatial_average["analysed_sst"].load() - 273.15 clim = threshold(SST_time_series)

Are you able to offer any insight?

Cheers,

Ben

You can find my test data set here (needs to be unzipped):

OSTIA_SST.nc.zip

paolap commented 1 year ago

Hi Ben, I never had this issue before, could you just confirm the following, so I'm sure I'm going to test with the same packages: xarray, pandas version and if the exact same workflow was working with 0.8.0 and an older version of xarray

CapeHobbit commented 1 year ago

My working example has the following versions under Python 3.9:

xarray 2022.3.0 pyhd8ed1ab_0 conda-forge pandas 1.5.3 py39hecff1ad_0 conda-forge xmhw 0.8.0 py_0 coecms

The xarray and xmhw versions are "pinned" in my yaml file. If I let them float freely then my versions are: xarray 2023.2.0 pyhd8ed1ab_0 conda-forge pandas 1.5.3 py39hecff1ad_0 conda-forge xmhw 0.9.0 py_0 coecms

I haven't made any changes to the routine.

florianboergel commented 1 year ago

Hey could you check if you haven a singleton coordinate left that has no use? For example, I iteratate over depth levels and need to drop the depth information.

sst = temperatureData.drop("depth").TEMP
sst = sst.chunk({'time': -1, 'lat': 'auto', 'lon': 'auto'})
sst.data

with that I can avoid this error.

CapeHobbit commented 1 year ago

Will do and report back tomorrow. Thanks!

CapeHobbit commented 1 year ago

Unfortunately, I can confirm that I have no singleton dimensions, and yet the error persists. Below is the xarray dump of the array I submit to threshold;

`<xarray.DataArray 'analysed_sst' (time: 3804)> dask.array<sub, shape=(3804,), dtype=float32, chunksize=(3804,), chunktype=numpy.ndarray> Coordinates:

It has dims of ('time',)

I admit, that final comma in the dims makes me nervous, but it seems to be how xarray displays the dimension data, and not indicative of a singleton dimension.

florianboergel commented 1 year ago

Maybe it is related to the dimension point that is added in the function land_check

    dims = list(temp.dims)
    # Add an extra fake dimensions if array 1-dimensional
    # so a 'cell' dimension can still be created
    dims.remove(tdim)
    if len(dims) == 0:
        temp = temp.expand_dims({"point": [0.0]})
        dims = ["point"]

Since your error is also pointing towards a variable called point. But I cannot really offer help to be honest.

CapeHobbit commented 1 year ago

Thanks @florianboergel - I'll dig a little deeper. I don't suppose you could add a small, known working netCDF file here, could you? It would be good to separate the toolkit from the data source before I start debugging.

paolap commented 1 year ago

It looks like the workaround to fix the coordinates issue with the new xarray dimension doesn't work for this extra dimension. Point is added if you have a "point" time series so stacking the dimensions will still work. I'll try to fix this today.

paolap commented 1 year ago

I made a new release (0.9.2) that should fix this. As it was getting annoying to fix the weird coordinates behaviour, I just introduced a check at the start of each function. If a 1-dimensional series is passed it sets a boolean variable 'point' to True and uses this to skip all the functions that handle multiple dimensions. Before it was trying to add the "point" fake dimension, so there was only one workflow independently from the number of dimensions.

CapeHobbit commented 1 year ago

Yup, works perfectly! Thank you so much.

florianboergel commented 1 year ago

I think the new point variable needs to be initialized as False, to avoid erros if the time series is not 1D. Referring to xmhw.py line 133 @paolap

Something like

    if len(dims) == 1:
        point = True
    else:
        point = False

since later on point is evaluated:

    if point:
        ts = temp
    else:
        ts = land_check(temp, tdim=tdim, anynans=anynans)
paolap commented 1 year ago

No you're right! I forgot as I first introduced it as an extra argument and then decided it was easier to just work it out! I should add a test for this as none of them failed. I'm pushing the changes now, thanks for pointing that out!

paolap commented 1 year ago

Done a new release 0.9.3 with the correction above, thanks!

florianboergel commented 1 year ago

@CapeHobbit I guess the issue can be closed then ;-)

CapeHobbit commented 1 year ago

@florianboergel @paolap - yes! Sorry about that :)