Problem in seperateData when the breaks are spot on dataX

cjekel / piecewise_linear_fit_py

fit piecewise linear data for a specified number of line segments

MIT License

301 stars 60 forks source link

Problem in seperateData when the breaks are spot on dataX #1

Closed half1red closed 6 years ago

half1red commented 6 years ago

Hi,

Great lib !

I think I have spotted an issue with it however. It seems that when the breaks are spot on the xdata the break point belongs at the same time in two arrays of sepData. Thus it increases the size of yHat in fitWithBreaks and an error is raised when the function calculate the sum of the square of residuals e = self.yData-yHat

My guess would be to change either the bTest or aTest in separateData to a strict inequality.

I am not sure I am being very clear, my english is a little bit rusty.

Have a nice day,

cjekel commented 6 years ago

I'm glad you like the library and thanks for spotting that something is wrong.

If you can give me a short code example so that I can reproduce the error, it would make my troubleshooting a bit easier. I'll see if I can fix this.

Thanks, CJ

cjekel commented 6 years ago

I think I can replicate the error with the following code

import numpy as np
import pwlf

x = np.array((0.0,1.0,2.0))
y = np.array((0.0,1.0,1.5))

my_fit = pwlf.piecewise_lin_fit(x,y)
x0 = x.copy()
x0[1] = 1.
ssr = my_fit.fitWithBreaks(x0)

Error:

     90 
     91         #   calculate the sum of the square of residuals
---> 92         e = self.yData-yHat
     93         SSr = np.dot(e.T,e)
     94 
ValueError: operands could not be broadcast together with shapes (3,) (4,)

If you change x0[1] to something other than 1.0 like 1.001 it works...

Nice find I'll try to fix

half1red commented 6 years ago

Looks like the error I found.

I try to add a piece of code in fitWithBreaks, just before e = self.yData-yHat in order to get rid of the duplicates. It probably could be more synthetic. It seems to work in my situation.

    double_ind = []
    last_x = [arr[-1] for arr in sepDataX[:-1]]
    first_x = [arr[0] for arr in sepDataX[1:]]
    indices = []
    count = -1
    for arr in sepDataX:
        count += arr.size
        indices.append(count)
    for i, (x1, x2) in enumerate(zip(first_x, last_x)):
        if x1 == x2:
            double_ind.append(indices[i])
    yHat = np.delete(yHat, double_ind)

cjekel commented 6 years ago

Do the changes in https://github.com/cjekel/piecewise_linear_fit_py/commit/9d2cd3ac656c64ecf47b11ffe74097ab149d6f4b fix your issue?

I think you were correct in your original post about aTest, bTest should be a strict greater than or less than.

I'd rather not search for duplicates manually as this may be computationally expensive, but your code should work.

half1red commented 6 years ago

Yes, that seems to correct my issue. I have seen that you also corrected seperateDataX. That's nice, because it was causing trouble to predict in the same situation with breaks on data points.

Thank you very much for your quick answers!

cjekel commented 6 years ago

Thanks for pointing this out! If you run into further issues please let me know.