aodn / python-aodntools

Repository for templates and code relating to generating standard NetCDF files for the Australia Ocean Data Network
GNU Lesser General Public License v3.0
10 stars 3 forks source link

Pandas TypeErrors in hourly_timeseries #117

Closed mhidas closed 2 years ago

mhidas commented 4 years ago

A couple of similar errors while trying to create the hourly products in the pipeline for some sites.

TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'Index'
  ...
  File "/mnt/ebs/pipeline/lib/python3.5/site-packages/aodntools/timeseries_products/hourly_timeseries.py", line 405, in PDresample_by_hour
    ds_var_mean = ds_var.resample('1H').apply(function_dict[variable]).astype(np.float32)

for

mhidas commented 4 years ago

While fixing this, should also apply the work-around for the pandas.Timedelta units issue, as done in the velocity hourly code (https://github.com/aodn/python-aodntools/pull/99#discussion_r391440014)

mhidas commented 4 years ago

The full stack traces are

TypeError: Operation sub between float64 and Timedelta is invalid
Traceback (most recent call last):
  File "/mnt/ebs/pipeline/lib/python3.5/site-packages/aodncore/pipeline/handlerbase.py", line 1052, in run
    self.trigger(transition['trigger'])
  File "/mnt/ebs/pipeline/lib/python3.5/site-packages/transitions/core.py", line 65, in _get_trigger
    return machine.events[trigger_name].trigger(model, *args, **kwargs)
  File "/mnt/ebs/pipeline/lib/python3.5/site-packages/transitions/core.py", line 405, in trigger
    return self.machine._process(func)
  File "/mnt/ebs/pipeline/lib/python3.5/site-packages/transitions/core.py", line 1073, in _process
    return trigger()
  File "/mnt/ebs/pipeline/lib/python3.5/site-packages/transitions/core.py", line 423, in _trigger
    return self._process(event_data)
  File "/mnt/ebs/pipeline/lib/python3.5/site-packages/transitions/core.py", line 433, in _process
    if trans.execute(event_data):
  File "/mnt/ebs/pipeline/lib/python3.5/site-packages/transitions/core.py", line 279, in execute
    machine.callback(func, event_data)
  File "/mnt/ebs/pipeline/lib/python3.5/site-packages/transitions/core.py", line 1031, in callback
    func(*event_data.args, **event_data.kwargs)
  File "/mnt/ebs/pipeline/lib/python3.5/site-packages/aodndata/moorings/products_handler.py", line 390, in preprocess
    self._make_hourly_timeseries()
  File "/mnt/ebs/pipeline/lib/python3.5/site-packages/aodndata/moorings/products_handler.py", line 300, in _make_hourly_timeseries
    **self.product_common_kwargs)
  File "/mnt/ebs/pipeline/lib/python3.5/site-packages/aodntools/timeseries_products/hourly_timeseries.py", line 507, in hourly_aggregator
    df_temp = PDresample_by_hour(df_temp, function_dict, function_stats)  # do the magic
  File "/mnt/ebs/pipeline/lib/python3.5/site-packages/aodntools/timeseries_products/hourly_timeseries.py", line 399, in PDresample_by_hour
    df.index = df.index - pd.Timedelta(30, units='m')
  File "/mnt/ebs/pipeline/lib/python3.5/site-packages/pandas/core/indexes/base.py", line 121, in index_arithmetic_method
    return self._evaluate_with_timedelta_like(other, op)
  File "/mnt/ebs/pipeline/lib/python3.5/site-packages/pandas/core/indexes/base.py", line 4980, in _evaluate_with_timedelta_like
    other=type(other).__name__))

and

TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'Index'
Traceback (most recent call last):
  File "/mnt/ebs/pipeline/lib/python3.5/site-packages/aodncore/pipeline/handlerbase.py", line 1052, in run
    self.trigger(transition['trigger'])
  File "/mnt/ebs/pipeline/lib/python3.5/site-packages/transitions/core.py", line 65, in _get_trigger
    return machine.events[trigger_name].trigger(model, *args, **kwargs)
  File "/mnt/ebs/pipeline/lib/python3.5/site-packages/transitions/core.py", line 405, in trigger
    return self.machine._process(func)
  File "/mnt/ebs/pipeline/lib/python3.5/site-packages/transitions/core.py", line 1073, in _process
    return trigger()
  File "/mnt/ebs/pipeline/lib/python3.5/site-packages/transitions/core.py", line 423, in _trigger
    return self._process(event_data)
  File "/mnt/ebs/pipeline/lib/python3.5/site-packages/transitions/core.py", line 433, in _process
    if trans.execute(event_data):
  File "/mnt/ebs/pipeline/lib/python3.5/site-packages/transitions/core.py", line 279, in execute
    machine.callback(func, event_data)
  File "/mnt/ebs/pipeline/lib/python3.5/site-packages/transitions/core.py", line 1031, in callback
    func(*event_data.args, **event_data.kwargs)
  File "/mnt/ebs/pipeline/lib/python3.5/site-packages/aodndata/moorings/products_handler.py", line 390, in preprocess
    self._make_hourly_timeseries()
  File "/mnt/ebs/pipeline/lib/python3.5/site-packages/aodndata/moorings/products_handler.py", line 300, in _make_hourly_timeseries
    **self.product_common_kwargs)
  File "/mnt/ebs/pipeline/lib/python3.5/site-packages/aodntools/timeseries_products/hourly_timeseries.py", line 507, in hourly_aggregator
    df_temp = PDresample_by_hour(df_temp, function_dict, function_stats)  # do the magic
  File "/mnt/ebs/pipeline/lib/python3.5/site-packages/aodntools/timeseries_products/hourly_timeseries.py", line 405, in PDresample_by_hour
    ds_var_mean = ds_var.resample('1H').apply(function_dict[variable]).astype(np.float32)
  File "/mnt/ebs/pipeline/lib/python3.5/site-packages/pandas/core/generic.py", line 8155, in resample
    base=base, key=on, level=level)
  File "/mnt/ebs/pipeline/lib/python3.5/site-packages/pandas/core/resample.py", line 1250, in resample
    return tg._get_resampler(obj, kind=kind)
  File "/mnt/ebs/pipeline/lib/python3.5/site-packages/pandas/core/resample.py", line 1380, in _get_resampler
    "but got an instance of %r" % type(ax).__name__)
mphemming commented 3 years ago

Thought this might be a good place to mention that I ran into an error when creating hourly timeseries products locally. Using the latest code on Github, I had to change output_dir=args.output_path on line 579 in 'hourly_timeseries.py' to output_dir=args.output_dir for the code to work. Easy fix but worth mentioning.

I also get warnings for function stringtochar() on lines 298 and 299 of 'aggregated_timeseries.py'. The warning suggests using function tobytes() instead.

mhidas commented 3 years ago

Thanks @mphemming - your feedback is welcome. However these are unrelated to this thread, so I've moved them to separate issues: #135 #136

mhidas commented 2 years ago

The original errors reported above occur under Python 3.5 In Python 3.8 when running the code on the same data we get different errors from different parts of the code. E.g. for site SAM7DS

test_aodntools/timeseries_products/test_hourly_timeseries.py:125 (TestHourlyTimeseriesDebugging.test_typeerror)
self = <xarray.backends.netCDF4_.NetCDF4ArrayWrapper object at 0x7f1ab30a1700>
key = (array([], dtype=int64), slice(None, None, None), slice(None, None, None))

    def _getitem(self, key):
        if self.datastore.is_remote:  # pragma: no cover
            getitem = functools.partial(robust_getitem, catch=RuntimeError)
        else:
            getitem = operator.getitem

        try:
            with self.datastore.lock:
                original_array = self.get_array(needs_lock=False)
>               array = getitem(original_array, key)

../../python-aodntools-py38/lib/python3.8/site-packages/xarray/backends/netCDF4_.py:106: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???

src/netCDF4/_netCDF4.pyx:4383: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

count = array([], shape=(0, 1, 1, 3), dtype=int64)

    def _out_array_shape(count):
        """Return the output array shape given the count array created by getStartCountStride"""

        s = list(count.shape[:-1])
        out = []

        for i, n in enumerate(s):
            if n == 1:
>               c = count[..., i].ravel()[0] # All elements should be identical.
E               IndexError: index 0 is out of bounds for axis 0 with size 0

../../python-aodntools-py38/lib/python3.8/site-packages/netCDF4/utils.py:458: IndexError

But also...

During handling of the above exception, another exception occurred:

self = <test_aodntools.timeseries_products.test_hourly_timeseries.TestHourlyTimeseriesDebugging testMethod=test_typeerror>

    def test_typeerror(self):
>       output_file, bad_files = hourly_aggregator(files_to_aggregate=SAM7_LIST,
                                                   site_code='SAM7DS',
                                                   qcflags=(1, 2),
                                                   input_dir=TEST_ROOT,
                                                   output_dir='/tmp'
                                                   )

test_hourly_timeseries.py:127: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../../aodntools/timeseries_products/hourly_timeseries.py:413: in hourly_aggregator
    nc_clean = in_water(nc)  # in water only

../../aodntools/timeseries_products/hourly_timeseries.py:79: in in_water
    return nc.where((TIME >= time_deployment_start) & (TIME <= time_deployment_end), drop=True)

...

IndexError: The indexing operation you are attempting to perform is not valid on netCDF4.Variable object. Try loading your data into memory first by calling .load().

../../python-aodntools-py38/lib/python3.8/site-packages/xarray/backends/netCDF4_.py:116: IndexError
mhidas commented 2 years ago

The first error ("Operation sub between float64 and Timedelta is invalid" in Py3.5) only occurs for files where all the data are flagged as bad, which results in trying to process an empty array. The error in Py3.8 happens for a similar reason - all the data are out-of-water, i.e. ouside the range set by time_deployment_start and time_deployment_end.