cjekel / piecewise_linear_fit_py

fit piecewise linear data for a specified number of line segments
MIT License
289 stars 59 forks source link

Number of Line Segments Needs to Be Specified #17

Open htcml opened 5 years ago

htcml commented 5 years ago

It looks like this library requires a user to specify the number of line segments before fitting a model. This means this library can't be used for automatic trend detection(this requires a program to figure out the right number of line segments itself)? It is not quite useful if it requires a user to visually inspect the chart to figure out how many line segments are.

cjekel commented 5 years ago

While this is not the focus of this library, it is possible to use pwlf to find an optimum number of line segments.

The problem with automatically detecting the number of line segments is that the objective function is not as well defined. Many questions will need to be answered. Do you introduce a regularizer? Do you introduce a penalty? If you do not have a lot of noise, it's likely that without some form of regularization or penalty function that you may have as many line segments as you have data points. This no longer becomes simply minimizing a L2 norm.

Do you have any suggestions on how to find the optimum number of line segments?


Let me give you an example of how you could use pwlf to find the best number of line segments. This example just uses a simple penalty parameter to penalize the model complexity, or the number of line segments. Essentially we want to find x such that my_obj is minimized. I then go on to minimize this function using bayesian optimization (or EGO). To see how this is done, check out the newly added example.

# define your objective function
def my_obj(x):
    # define some penalty parameter l
    # you'll have to arbitrarily pick this
    # it depends upon the noise in your data,
    # and the value of your sum of square of residuals
    l = y.mean()*0.001
    f = np.zeros(x.shape[0])
    for i, j in enumerate(x):
        my_pwlf.fit(j[0])
        f[i] = my_pwlf.ssr + (l*j[0])
    return f
cjekel commented 5 years ago

Would it make sense in your application that breakpoint locations can only occur at a data point? That could greatly reduce the time of the inner optimization.

You could also try using the fitfast() function instead of the fit() function within my_obj.

cjekel commented 5 years ago

I want to open this up, because depending upon your problem their may be easier ways to deal with finding the best number of line segments.

If you don't have noise in your data, it may be possible to detect the number of line segments from a simple least squares fit. I try to show that in the first example here (this requires the 0.4.0 version in development)

hyperactivetimes commented 3 years ago

While this is not the focus of this library, it is possible to use pwlf to find an optimum number of line segments.

The problem with automatically detecting the number of line segments is that the objective function is not as well defined. Many questions will need to be answered. Do you introduce a regularizer? Do you introduce a penalty? If you do not have a lot of noise, it's likely that without some form of regularization or penalty function that you may have as many line segments as you have data points. This no longer becomes simply minimizing a L2 norm.

Do you have any suggestions on how to find the optimum number of line segments?

Let me give you an example of how you could use pwlf to find the best number of line segments. This example just uses a simple penalty parameter to penalize the model complexity, or the number of line segments. Essentially we want to find x such that _myobj is minimized. I then go on to minimize this function using bayesian optimization (or EGO). To see how this is done, check out the newly added example.

# define your objective function
def my_obj(x):
    # define some penalty parameter l
    # you'll have to arbitrarily pick this
    # it depends upon the noise in your data,
    # and the value of your sum of square of residuals
    l = y.mean()*0.001
    f = np.zeros(x.shape[0])
    for i, j in enumerate(x):
        my_pwlf.fit(j[0])
        f[i] = my_pwlf.ssr + (l*j[0])
    return f

hello there! i'm currently working on a problem which requires using this package. however, determining the optimal number of line segments in advance is a must. so as i was searching the net i stumbled upon this piece of code in which the underlying idea i assume to be a tradeoff between "ssr" and the "number of line segments". however "j" is a scalar and can not be indexed so i got an error. then i was wondering if i could replace the "j[0]" with "i" or not? on top of that for i = 1 i receive another error.

cjekel commented 3 years ago

The short answer is that x in the above case was expected to be a 2d array.

just do

def my_obj(x):
    # x is integer, returns float
    l = y.mean()*0.001
    my_pwlf.fit(x)
    f = my_pwlf.ssr + (l*x)
    return f

For a longer answer. I've been working on a better way to do this.

How many data points do you have in your application? What is the upper limit of line segments you are searching for?

hyperactivetimes commented 3 years ago

The short answer is that x in the above case was expected to be a 2d array.

just do

def my_obj(x):
    # x is integer, returns float
    l = y.mean()*0.001
    my_pwlf.fit(x)
    f = my_pwlf.ssr + (l*x)
    return f

For a longer answer. I've been working on a better way to do this.

How many data points do you have in your application? What is the upper limit of line segments you are searching for?

thanks for the fast response. the short version solved the issue. i have 25 data points sampled from "a/x" curve in which "a" is the constant and "x" is the variable. the less line segments i have the better it will be for my following intentions.

cjekel commented 3 years ago

@hyperactivetimes Another approach I have been playing around with, that would work well in your case is to compare the model's prediction variance, and compare this to an approximate variance from the data. Ideally this ratio will be 1.0.

This is a proof of concept, that manually tries 2 segments, then 3, then 4, until we've broken the ratio.

import numpy as np
import matplotlib.pyplot as plt
import pwlf

def calcGSJ(data):
    """
    Estimate the variance in data u sing the GSJ method
    """
    x = data[:, 0]
    y = data[:, 1]
    n = len(x)
    eTilde = np.zeros(len(x)-2)
    ci2 = np.zeros(len(x)-2)
    #    GSJ method
    for i in range(1, n-1):
        di = x[i+1]-x[i]
        if di == 0:
            ai = 0.5
        else:
            ai = (x[i+1] - x[i]) / di
        bi = 1.0 - ai
        ci2[i-1] = 1.0 / (ai**2 + bi**2 + 1.0)
        eTilde[i-1] = ai*y[i-1] + bi*y[i+1] - y[i]        
    sigmaHat2 = (1.0/(n-2))*np.dot(ci2.T, eTilde**2)
    print('GSJ: ', sigmaHat2)
    return sigmaHat2

np.random.seed(1231)
# generate sin wave data
x = np.linspace(0, 10, num=100)
y = np.sin(x * np.pi / 2)
# add noise to the data
y = np.random.normal(0, 0.05, 100) + y

data = np.vstack((x, y)).T

sigma_hat = calcGSJ(data)

# initialize piecewise linear fit with your x and y data
my_pwlf = pwlf.PiecewiseLinFit(x, y)

keep_going = True
n_segment = 2
results = {}
while keep_going:
    _ = my_pwlf.fitfast(n_segment, pop=10)
    nb = my_pwlf.beta.size + my_pwlf.fit_breaks.size - 2
    sigma = my_pwlf.ssr / (my_pwlf.n_data - nb)
    F = sigma_hat / sigma
    results[n_segment] = F
    if F > 1.0:
        keep_going = False
        print(f'best result found with n_segment={n_segment - 1}')
        n_segment -= 1
    else:
        n_segment += 1

# perform the fit again
_ = my_pwlf.fitfast(n_segment, pop=10)
# predict for the determined points
xHat = np.linspace(min(x), max(x), num=10000)
yHat = my_pwlf.predict(xHat)

# plot the results
plt.figure()
plt.plot(x, y, 'o')
plt.plot(xHat, yHat, '-', label=f'pwlf {n_segment}')
_ = my_pwlf.fitfast(n_segment+1, pop=10)
yHat = my_pwlf.predict(xHat)
plt.plot(xHat, yHat, '-', label=f'pwlf {n_segment+1}')
plt.legend()
plt.show()

I would love to hear if this works better, or worse in your application. There could be better ways to do the search for the magic ratio? (binary search, golden ratio, quadratic fits? )

thunderbug1 commented 2 years ago

A veeery simplistic approach to this problem is to fit more and more lines to the data until adding more lines does not give any meaningful improvement. Maybe not the fastest approach but works well enough for my application.

x = series.index.to_numpy()
y = series.to_numpy()
my_pwlf = pwlf.PiecewiseLinFit(x, y)
max_line_count = 10
min_percent_improvement_to_add_line = 80

last_ssr = None
best_pwlf = None
for i in range(2, max_line_count):
    #res = my_pwlf.fit(i)
    res = my_pwlf.fitfast(i, pop=3) # runs faster than fit but does not guarantee optimal solution
    ssr = my_pwlf.ssr
    if last_ssr is not None:
        improvement_percent = ( last_ssr - ssr )/ last_ssr * 100
        if improvement_percent < min_percent_improvement_to_add_line:
            break
    last_ssr = ssr
    best_pwlf = my_pwlf
print(f"best line_count: {i-1}")
xhat = np.linspace(x.min(), x.max(), len(x))
yhat = best_pwlf.predict(xhat)
cjekel commented 2 years ago

@thunderbug1 Thanks for sharing this!

One thing to note as the number of line segments approaches the number of data points, SSR will approach zero.

thunderbug1 commented 2 years ago

@cjekel Thats very true, from what I have seen so far, the code stops at a reasonable number of lines. One could even add a dynamically set max_line_count if the signal length/type varies to avoid overfitting in edge cases.