deepcharles / ruptures

ruptures: change point detection in Python
BSD 2-Clause "Simplified" License
1.56k stars 161 forks source link

Return a group index with the same shape as the signal #232

Closed Illviljan closed 2 years ago

Illviljan commented 2 years ago

Once you have found the change points, you usually want to do some kind of operation (like mean()) on those groups of data within the change points.

It's quite common to use pandas for that (df.groupby("group_idx")). But dataframes requires the group_idx to have the same shape as the signal. But ruptures return only the index in some kind of shape.

The suggestion is therefore to instead of returning a list of indexes to simply fill a group_idx array with the change point number and return that array, for example: [2, 5] -> [0, 0, 2, 2, 2, 5, 5]

This could potentially improve performance as it's often faster to pre-allocate arrays with the correct size and fill in values rather than repeatedly changing the size of the array (when I've been benchmarking, .append has been one of the bottlenecks). Matlab has a nice article explaining it better than I could.

This is quite a breaking change so it probably has to be optional for a while.

Examples of packages using group_idx style: pandas dask dataframes numpy_groupies flox

deepcharles commented 2 years ago

Hi, thanks for the nice suggestion. Indeed, this would be a breaking change. Nevertheless we could add an optional argument to change the output format.