Open htcml opened 5 years ago
While this is not the focus of this library, it is possible to use pwlf to find an optimum number of line segments.
The problem with automatically detecting the number of line segments is that the objective function is not as well defined. Many questions will need to be answered. Do you introduce a regularizer? Do you introduce a penalty? If you do not have a lot of noise, it's likely that without some form of regularization or penalty function that you may have as many line segments as you have data points. This no longer becomes simply minimizing a L2 norm.
Do you have any suggestions on how to find the optimum number of line segments?
Let me give you an example of how you could use pwlf to find the best number of line segments. This example just uses a simple penalty parameter to penalize the model complexity, or the number of line segments. Essentially we want to find x such that my_obj is minimized. I then go on to minimize this function using bayesian optimization (or EGO). To see how this is done, check out the newly added example.
# define your objective function
def my_obj(x):
# define some penalty parameter l
# you'll have to arbitrarily pick this
# it depends upon the noise in your data,
# and the value of your sum of square of residuals
l = y.mean()*0.001
f = np.zeros(x.shape[0])
for i, j in enumerate(x):
my_pwlf.fit(j[0])
f[i] = my_pwlf.ssr + (l*j[0])
return f
Would it make sense in your application that breakpoint locations can only occur at a data point? That could greatly reduce the time of the inner optimization.
You could also try using the fitfast()
function instead of the fit()
function within my_obj
.
I want to open this up, because depending upon your problem their may be easier ways to deal with finding the best number of line segments.
If you don't have noise in your data, it may be possible to detect the number of line segments from a simple least squares fit. I try to show that in the first example here (this requires the 0.4.0 version in development)
While this is not the focus of this library, it is possible to use pwlf to find an optimum number of line segments.
The problem with automatically detecting the number of line segments is that the objective function is not as well defined. Many questions will need to be answered. Do you introduce a regularizer? Do you introduce a penalty? If you do not have a lot of noise, it's likely that without some form of regularization or penalty function that you may have as many line segments as you have data points. This no longer becomes simply minimizing a L2 norm.
Do you have any suggestions on how to find the optimum number of line segments?
Let me give you an example of how you could use pwlf to find the best number of line segments. This example just uses a simple penalty parameter to penalize the model complexity, or the number of line segments. Essentially we want to find x such that _myobj is minimized. I then go on to minimize this function using bayesian optimization (or EGO). To see how this is done, check out the newly added example.
# define your objective function def my_obj(x): # define some penalty parameter l # you'll have to arbitrarily pick this # it depends upon the noise in your data, # and the value of your sum of square of residuals l = y.mean()*0.001 f = np.zeros(x.shape[0]) for i, j in enumerate(x): my_pwlf.fit(j[0]) f[i] = my_pwlf.ssr + (l*j[0]) return f
hello there! i'm currently working on a problem which requires using this package. however, determining the optimal number of line segments in advance is a must. so as i was searching the net i stumbled upon this piece of code in which the underlying idea i assume to be a tradeoff between "ssr" and the "number of line segments". however "j" is a scalar and can not be indexed so i got an error. then i was wondering if i could replace the "j[0]" with "i" or not? on top of that for i = 1 i receive another error.
The short answer is that x in the above case was expected to be a 2d array.
just do
def my_obj(x):
# x is integer, returns float
l = y.mean()*0.001
my_pwlf.fit(x)
f = my_pwlf.ssr + (l*x)
return f
For a longer answer. I've been working on a better way to do this.
How many data points do you have in your application? What is the upper limit of line segments you are searching for?
The short answer is that x in the above case was expected to be a 2d array.
just do
def my_obj(x): # x is integer, returns float l = y.mean()*0.001 my_pwlf.fit(x) f = my_pwlf.ssr + (l*x) return f
For a longer answer. I've been working on a better way to do this.
How many data points do you have in your application? What is the upper limit of line segments you are searching for?
thanks for the fast response. the short version solved the issue. i have 25 data points sampled from "a/x" curve in which "a" is the constant and "x" is the variable. the less line segments i have the better it will be for my following intentions.
@hyperactivetimes Another approach I have been playing around with, that would work well in your case is to compare the model's prediction variance, and compare this to an approximate variance from the data. Ideally this ratio will be 1.0.
This is a proof of concept, that manually tries 2 segments, then 3, then 4, until we've broken the ratio.
import numpy as np
import matplotlib.pyplot as plt
import pwlf
def calcGSJ(data):
"""
Estimate the variance in data u sing the GSJ method
"""
x = data[:, 0]
y = data[:, 1]
n = len(x)
eTilde = np.zeros(len(x)-2)
ci2 = np.zeros(len(x)-2)
# GSJ method
for i in range(1, n-1):
di = x[i+1]-x[i]
if di == 0:
ai = 0.5
else:
ai = (x[i+1] - x[i]) / di
bi = 1.0 - ai
ci2[i-1] = 1.0 / (ai**2 + bi**2 + 1.0)
eTilde[i-1] = ai*y[i-1] + bi*y[i+1] - y[i]
sigmaHat2 = (1.0/(n-2))*np.dot(ci2.T, eTilde**2)
print('GSJ: ', sigmaHat2)
return sigmaHat2
np.random.seed(1231)
# generate sin wave data
x = np.linspace(0, 10, num=100)
y = np.sin(x * np.pi / 2)
# add noise to the data
y = np.random.normal(0, 0.05, 100) + y
data = np.vstack((x, y)).T
sigma_hat = calcGSJ(data)
# initialize piecewise linear fit with your x and y data
my_pwlf = pwlf.PiecewiseLinFit(x, y)
keep_going = True
n_segment = 2
results = {}
while keep_going:
_ = my_pwlf.fitfast(n_segment, pop=10)
nb = my_pwlf.beta.size + my_pwlf.fit_breaks.size - 2
sigma = my_pwlf.ssr / (my_pwlf.n_data - nb)
F = sigma_hat / sigma
results[n_segment] = F
if F > 1.0:
keep_going = False
print(f'best result found with n_segment={n_segment - 1}')
n_segment -= 1
else:
n_segment += 1
# perform the fit again
_ = my_pwlf.fitfast(n_segment, pop=10)
# predict for the determined points
xHat = np.linspace(min(x), max(x), num=10000)
yHat = my_pwlf.predict(xHat)
# plot the results
plt.figure()
plt.plot(x, y, 'o')
plt.plot(xHat, yHat, '-', label=f'pwlf {n_segment}')
_ = my_pwlf.fitfast(n_segment+1, pop=10)
yHat = my_pwlf.predict(xHat)
plt.plot(xHat, yHat, '-', label=f'pwlf {n_segment+1}')
plt.legend()
plt.show()
I would love to hear if this works better, or worse in your application. There could be better ways to do the search for the magic ratio? (binary search, golden ratio, quadratic fits? )
A veeery simplistic approach to this problem is to fit more and more lines to the data until adding more lines does not give any meaningful improvement. Maybe not the fastest approach but works well enough for my application.
x = series.index.to_numpy()
y = series.to_numpy()
my_pwlf = pwlf.PiecewiseLinFit(x, y)
max_line_count = 10
min_percent_improvement_to_add_line = 80
last_ssr = None
best_pwlf = None
for i in range(2, max_line_count):
#res = my_pwlf.fit(i)
res = my_pwlf.fitfast(i, pop=3) # runs faster than fit but does not guarantee optimal solution
ssr = my_pwlf.ssr
if last_ssr is not None:
improvement_percent = ( last_ssr - ssr )/ last_ssr * 100
if improvement_percent < min_percent_improvement_to_add_line:
break
last_ssr = ssr
best_pwlf = my_pwlf
print(f"best line_count: {i-1}")
xhat = np.linspace(x.min(), x.max(), len(x))
yhat = best_pwlf.predict(xhat)
@thunderbug1 Thanks for sharing this!
One thing to note as the number of line segments approaches the number of data points, SSR will approach zero.
@cjekel Thats very true, from what I have seen so far, the code stops at a reasonable number of lines. One could even add a dynamically set max_line_count if the signal length/type varies to avoid overfitting in edge cases.
It looks like this library requires a user to specify the number of line segments before fitting a model. This means this library can't be used for automatic trend detection(this requires a program to figure out the right number of line segments itself)? It is not quite useful if it requires a user to visually inspect the chart to figure out how many line segments are.