aeon-toolkit / aeon

A toolkit for machine learning from time series
https://aeon-toolkit.org/
BSD 3-Clause "New" or "Revised" License
909 stars 96 forks source link

[ENH] ShapeletTransform: binary ig calculation problem #1322

Open zjeqw opened 4 months ago

zjeqw commented 4 months ago

Describe the bug

The current _calc_binary_ig( ) evaluates split points between data points with the same feature values but different labels, which might not be suitable for datasets that contain a lot of such data points.

Steps/Code to reproduce the bug

from aeon.transformations.collection.shapelet_based._shapelet_transform import _calc_binary_ig orderline = [(2,-1),(2,-1),(2,1),(3,1),(3,1)] c1, c2 = 3, 2 _calc_binary_ig(orderline,c1,c2)

Expected results

0.42

Actual results

0.97

Versions

System: python: 3.9.7 (tags/v3.9.7:1016ef3, Aug 30 2021, 20:19:38) [MSC v.1929 64 bit (AMD64)] executable: c:\xxx\python.exe machine: Windows-10-10.0.19041-SP0

Python dependencies: pip: 22.3.1 setuptools: 57.4.0 scikit-learn: 1.4.0 aeon: 0.7.1 statsmodels: None numpy: 1.24.0 scipy: 1.10.1 pandas: 2.0.3 matplotlib: 3.5.0 joblib: 1.3.2 numba: 0.58.1 pmdarima: None tsfresh: None

TonyBagnall commented 4 months ago

thanks for this, we will take a look next week

TonyBagnall commented 2 months ago

next week became next month sorry about that....

I dont think this really constitutes a bug really, its true to the algorithm.

I guess for the above you are recommending ignoring splits such as [(2,-1), (2,-1)], [(2,1),(3,1),(3,1)] so we would then evaluate (default split) [ ] [(2,-1), (2,-1),(2,1),(3,1),(3,1)] skip [(2,-1)] [(2,-1),(2,1),(3,1),(3,1)] split == 0 I think by the logic and [(2,-1),(2,-1)] ,[(2,1),(3,1),(3,1)] split == 1

then continue with [(2,-1), (2,-1),(2,1)] [(3,1),(3,1)] split == 2

I can enforce this

    # evaluate each split point
    for split in range(len(orderline)):
        next_class = orderline[split][1]  # +1 if this class, -1 if other
        # Check here that the distance is different to the next one
        if split == 0 and orderline[split][0] == orderline[split+1][0]:
            continue
        elif orderline[split][0] == orderline[split-1][0]:
            continue

need to double check the logic a bit confusing about first item, but this gives me IG 0.770950 not of 0.42