biolab / orange3-timeseries

🍊 :chart_with_upwards_trend: Orange add-on for analyzing, visualizing, manipulating, and forecasting time series data.
Other
62 stars 41 forks source link

Unable to interpolate Timeseries with nan-values in first Column #165

Closed HannesLum closed 2 years ago

HannesLum commented 3 years ago
Timeseries version

0.3.10

Orange version

3.27.1

Expected behavior

When using the Interpolate-Widget as can be seen in debugging_owinterpolate.zip with a table like dt_interp.xlsx, the interpolation should work regardless of which column is used in "As Timeseries".

Actual behavior

The Interpolation-Widget will fail when there is a variable with datatype "datetime" in the first column and it contains nan-values. When trying to create a new Table via Timeseries.from_numpy(), it has an expection when it fails to convert the nan-value from float to int, as int has no nan-values. The exception occurs (after removing the try-block which catches all exceptions in owinterpolate.py at line 69) as can be seen in interp_exception.txt.

New code at owinterpolate.py line 69 to reproduce the exception: self.Outputs.interpolated.send(data.interp())

Using "debugging_owinterpolate.ows" with the "dt_interp.xlsx", I came up with the following results: During the interpolation-process when using the colums "Other Timestamp" and "Full Timestamp" the Timeseries.time_delta gets set to None. The reason for this is, because data.copy() is used at line 66: data = data.copy()

My guess is, that the @setter annotations for Timeseries.time_variable are triggered and in turn time_delta is somehow set to None. Resetting the time_variable a second time like so data.time_variable = data.domain[self.data.time_variable.name] immediately after line 66 fixes the bug.

Second Issue I came across, while trying to solve the bug:

there seems to be a problem within owtabletotimeseries or timeseries.from_data_table(). When using "instance order" as time_variable both time_variable and time_delta are None, meaning they will be reset in Timeseries.from_data_table() because of the first "if" inside the function. This should not happen when you want to interpolate. (Are there cases where you want that to happen? If not, I think this can be seen as a model-Problem regarding what qualifies as a Timeseries and what not.)

Fixes: In the case of timeseries being defined by a sequence maybe explicitly using Timeseries.make_timeseries_from_sequence() helps or remove the second and third clause from the first "if" in Timeseries.from_data_table(). After all, the Timeseries with a sequence is defined by not setting the time_variable, so Timeseries.from_data_table() should also see it as such as well.

Third Issue I came across: I could be wrong, but in timeseries.py line 139 table = table[~nans] schould be replaced by ts = ts[~nans] Otherwise the line itself would have no use whatsoever, as table is never used again.

Fourth Issue I came across: In Timeseries.from_data_table() the table gets sorted when Timeseries.make_timeseries_from_continuous_var() is called but not, when a standard-column is used as time-variable. What I want to say is, there should be a sorting algorithm added to the part where the nans are filtered out beginning from line 133 until 138 in timeseries.py, otherwise the behaviour of the program would become arbitrary. An example code might look like this ( Idid not test the code, I mostly copied it from make_timeseries_from_continuous_var()):

values = table.get_column_view(time_variable)[0]
if np.issubdtype(values.dtype, np.number):
    # Filter out NaNs
    nans = np.isnan(values)
    if nans.any():
        ts = ts[~nans]
        values = values[~nans]
    ordered = np.argsort(values)
    if (ordered != np.arange(len(ordered))).any():
        table = table[ordered]
HannesLum commented 3 years ago

Design-Wise it might be a good idea to override data.copy() in Timeseries so that the issue won't resurface at other places where it is used.

janezd commented 2 years ago

Note to whomever handles this: consider eliminating class TimeSeries (consult me or, even better, @markotoplak).

Even if the class remains, refactor interpolation methods that are currently in TimeSeries class into standalone functions or classes.

janezd commented 2 years ago

The third issue is (being) (trivially) fixed in #240.

janezd commented 2 years ago

@HannesLum, this is an exemplary report that seemed to have received no reaction. I'm sorry for that. The problem is that this add-on did not (and still does not) have an active maintainer.

As your report indicates, Timeseries.from_data_table is a mess and a source of this and other problems. This is basically because of two commits. 9cc0ba4707123c839e21c36c9565136d5efb913a, and 464ea9e30b33bb8cac0af7624733cedeca9f1691 adds to that by modifying the table it got as argument(!).

9cc0ba4707123c839e21c36c9565136d5efb913a 's intentions was, I believe, to allow passing time_variable as argument to all named constructors. It added a bug (not actually removing missing values) and a feature - sorting by values of variable, but only if variable is not a time variable. Both are trivial to fix by simply forwarding the call to make_timeseries_from_continuous_var, as you essentially did by copying the code.

464ea9e30b33bb8cac0af7624733cedeca9f1691 patched a problem when a table constructed from timeseries by selecting a subset of columns that do not include time_variable. The patch modifies the table it received as an argument(!!!), which, huh, appears to work in the context of the particular bug, but may do a lot of collateral damage. The proper fix needs to be done in (at least) __getitem__. You also wrote we need to reimplement copy. God knows how many other methods, too.

In #240 I changed Timeseries.from_data_table to

    @classmethod
    def from_data_table(cls, table, time_attr=None):
        if isinstance(table, Timeseries) and time_attr is table.time_variable:
            return table

        if time_attr is not None:
            if time_attr not in table.domain:
                raise Exception(time_attr.name + ' is not in the domain.')
            if not time_attr.is_continuous:
                raise Exception(time_attr.name + ' must be continuous.')
        else:
            for time_attr in chain(table.domain.attributes, table.domain.metas):
                if time_attr.is_time:
                    break
            else:
                return super(Timeseries, cls).from_table(table.domain, table)

        return cls.make_timeseries_from_continuous_var(table, time_attr)

I believe this does all that the previous function intended to (but properly).

(Locally, ) unit tests pass, though unit tests for this add-on also leave much to be desired.

But some day the entire class Timeseries needs to be rethought.

HannesLum commented 2 years ago

@janezd Thank you for your work. I was able to make do with my own fixes for the time being, so there was no need to hurry anyways, so don't worry about it.

And well, as you said Timeseries does feel a little messy and unfinished when you go through the code, so I wish you all the best for when you ever get around to reworking it. Hopefully, it will not require the entire add-on to be reworked, but it's not like i can decide something like that.