coecms / xmhw

Xarray version of Marine Heatwaves code by Eric Olivier
https://xmhw.readthedocs.io/en/latest/
Apache License 2.0
21 stars 10 forks source link

`unique_dropna` function "Must produce aggregated value" #40

Closed Eisbrenner closed 2 years ago

Eisbrenner commented 2 years ago

Hi, I run into an issue with the unique_dropna function applied in the pandas aggregation in agg_df. In my current pandas version (1.4.0) the return type of unique_dropna (a numpy.ndarray) is not valid for an aggregation.

In an experimental fix I just used the first value of the return array (assuming that it would be one at all times anyway), which seems to work. But I'm not quite sure if this conflicts with some other part or ideas.

I got the, lets say solution, through this question/answer https://stackoverflow.com/a/39842473

So here the quick and dirty fix:

def unique_dropna(s):
    """Apply dropna before calling unique to return only non NaN values"""
    uniques = s.dropna().unique()
    if len(uniques) > 1:  # maybe `!= 1` instead if required at all
        raise ValueError("There can be only one.")
    return uniques[0]

# or in short
def unique_dropna(s):
    """Apply dropna before calling unique to return only non NaN values"""
    return s.dropna().unique()[0]

Best regards and thanks for providing this package :)

full error log below:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/media/apate/Kroker/marine-heatwaves/.venv/lib/python3.9/site-packages/xmhw/xmhw.py", line 340, in detect
    results = dask.compute(mhwls)
  File "/media/apate/Kroker/marine-heatwaves/.venv/lib/python3.9/site-packages/dask/base.py", line 571, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/media/apate/Kroker/marine-heatwaves/.venv/lib/python3.9/site-packages/distributed/client.py", line 2691, in get
    results = self.gather(packed, asynchronous=asynchronous, direct=direct)
  File "/media/apate/Kroker/marine-heatwaves/.venv/lib/python3.9/site-packages/distributed/client.py", line 1946, in gather
    return self.sync(
  File "/media/apate/Kroker/marine-heatwaves/.venv/lib/python3.9/site-packages/distributed/utils.py", line 310, in sync
    return sync(
  File "/media/apate/Kroker/marine-heatwaves/.venv/lib/python3.9/site-packages/distributed/utils.py", line 364, in sync
    raise exc.with_traceback(tb)
  File "/media/apate/Kroker/marine-heatwaves/.venv/lib/python3.9/site-packages/distributed/utils.py", line 349, in f
    result[0] = yield future
  File "/media/apate/Kroker/marine-heatwaves/.venv/lib/python3.9/site-packages/tornado/gen.py", line 762, in run
    value = future.result()
  File "/media/apate/Kroker/marine-heatwaves/.venv/lib/python3.9/site-packages/distributed/client.py", line 1811, in _gather
    raise exception.with_traceback(traceback)
  File "/media/apate/Kroker/marine-heatwaves/.venv/lib/python3.9/site-packages/xmhw/identify.py", line 353, in define_events
    dfmhw = mhw_features(df, len(idxarr)-1)
  File "/media/apate/Kroker/marine-heatwaves/.venv/lib/python3.9/site-packages/xmhw/features.py", line 91, in mhw_features
    df = agg_df(dftime)
  File "/media/apate/Kroker/marine-heatwaves/.venv/lib/python3.9/site-packages/xmhw/features.py", line 121, in agg_df
    dfout = df.groupby('events').agg(
  File "/media/apate/Kroker/marine-heatwaves/.venv/lib/python3.9/site-packages/pandas/core/groupby/generic.py", line 869, in aggregate
    result = op.agg()
  File "/media/apate/Kroker/marine-heatwaves/.venv/lib/python3.9/site-packages/pandas/core/apply.py", line 168, in agg
    return self.agg_dict_like()
  File "/media/apate/Kroker/marine-heatwaves/.venv/lib/python3.9/site-packages/pandas/core/apply.py", line 475, in agg_dict_like
    results = {
  File "/media/apate/Kroker/marine-heatwaves/.venv/lib/python3.9/site-packages/pandas/core/apply.py", line 476, in <dictcomp>
    key: obj._gotitem(key, ndim=1).agg(how) for key, how in arg.items()
  File "/media/apate/Kroker/marine-heatwaves/.venv/lib/python3.9/site-packages/pandas/core/groupby/generic.py", line 271, in aggregate
    ret = self._aggregate_multiple_funcs(func)
  File "/media/apate/Kroker/marine-heatwaves/.venv/lib/python3.9/site-packages/pandas/core/groupby/generic.py", line 326, in _aggregate_multiple_funcs
    results[key] = self.aggregate(func)
  File "/media/apate/Kroker/marine-heatwaves/.venv/lib/python3.9/site-packages/pandas/core/groupby/generic.py", line 287, in aggregate
    return self._python_agg_general(func, *args, **kwargs)
  File "/media/apate/Kroker/marine-heatwaves/.venv/lib/python3.9/site-packages/pandas/core/groupby/groupby.py", line 1481, in _python_agg_general
    result = self.grouper.agg_series(obj, f)
  File "/media/apate/Kroker/marine-heatwaves/.venv/lib/python3.9/site-packages/pandas/core/groupby/ops.py", line 981, in agg_series
    result = self._aggregate_series_pure_python(obj, func)
  File "/media/apate/Kroker/marine-heatwaves/.venv/lib/python3.9/site-packages/pandas/core/groupby/ops.py", line 1010, in _aggregate_series_pure_python
    libreduction.check_result_array(res, group.dtype)
  File "pandas/_libs/reduction.pyx", line 13, in pandas._libs.reduction.check_result_array
    cpdef check_result_array(object obj, object dtype):
  File "pandas/_libs/reduction.pyx", line 21, in pandas._libs.reduction.check_result_array
    raise ValueError("Must produce aggregated value")
ValueError: Must produce aggregated value
paolap commented 2 years ago

Hi, thanks for pointing this out, I have pandas v1.3.5 installed, so I guess that's why this never happen and thanks for proposing a solution. I'll update pandas and look into this later today

paolap commented 2 years ago

Hi, this can actually be simplifies further by simply using

        index_start = ('start', 'first'),
        index_end = ('end', 'first'),

in the agg_df() function (roughly line 118 in features.py)

I didn't realised before this would return the first non-NaN value in the series rather than the first actual value. I'm not sure if this fix would work with pandas 1.4.0 or 1.4.1 as we have pinned our pandas to 1.3.5, as 1.4.0 was breaking other packages. As the gruopby.agg.first is used elsewhere I'm assuming this shouldn't be an issue Thanks again for pointing this out, I might not push the changes yet to the main branch, as there were other parts I was working on, but I pushed them to the 'timestep' branch

paolap commented 2 years ago

This is now resolved in the new release 8.0