Breakpoints found in DataFrame but not in NumpyArray with same data

deepcharles / ruptures

ruptures: change point detection in Python

BSD 2-Clause "Simplified" License

1.54k stars 160 forks source link

Breakpoints found in DataFrame but not in NumpyArray with same data #318

Closed jdkworld closed 6 months ago

jdkworld commented 6 months ago

I have this signal, when I input it into Binseg as Pandas DataFrame, I get the correct breakpoints but when I input it as Numpy Array, it does not find any breakpoints. Am I missing something? Why is the behaviour different? Can it be due to the way in which NaNs are handled in both cases? Also, when I have two the same columns in my dataframe, into breakpoints are found.

signal.csv

dataframe = pd.read_csv('signal.csv', header=None)

# as dataframe
signal = dataframe
algo = rpt.Binseg(model="normal", min_size=12*24*7, jump=12*24).fit(signal)
result = algo.predict(pen=100)  
rpt.display(signal, result)
plt.show()
# WORKS

# as numpy array
signal = dataframe.values
algo = rpt.Binseg(model="normal", min_size=12*24*7, jump=12*24).fit(signal)
result = algo.predict(pen=100)  
rpt.display(signal, result)
plt.show()
# DOES NOT WORK

# as dataframe with 2 columns with exactly the same data
signal = dataframe
signal['1'] = signal[0]
algo = rpt.Binseg(model="normal", min_size=12*24*7, jump=12*24).fit(signal)
result = algo.predict(pen=100)  
rpt.display(signal, result)
plt.show()
# DOES NOT WORK

oboulant commented 6 months ago

Hi @jdkworld ,

Thx for you interest in ruptures.

This is because of the nan values in your series ! Indeed, ruptures expects the user to have handled on its own missing data. If ruptures has as input series with missing data, then the behaviour is unexpected.

If you remove the missing data, the outputs "looks" fine.

series = dataframe.to_numpy(dtype='float', na_value=np.nan)
print(f"Raw data : shape is {series.shape}")
series = series[~np.isnan(series)]
print(f"After removing the nans : shape is {series.shape}")
algo = rpt.Binseg(model="normal", min_size=12*24*7, jump=12*24).fit(series)
result = algo.predict(pen=100)  
rpt.display(series, result)
plt.show()

which outputs

I hope this helps ! Let us know !

Olivier

jdkworld commented 6 months ago

Hi Olivier,

Thanks a lot for your answer. The signal is a timeseries and I still want min_size and jump to correspond to the correct time period. So just removing the NaNs is no option. As I understand you, I should therefore fill in all missing data so that no NaN values are left and the time interval for each step is constant?

Josien

oboulant commented 6 months ago

If you want to keep the timeseries' structure along the time axis, then yes you have to fill the missing values with something.

And here, there are many many strategies (0.0, last known value, randomly draw from the series, mean or median on a particular time window, etc), but it all depends your use case and this is a decision you have to make according to the underlying goal of the task you are trying to solve !

Hope it helps !

Olivier