fpetitjean / DBA

DBA: Averaging for Dynamic Time Warping
http://www.francois-petitjean.com/Research/
GNU General Public License v3.0
183 stars 45 forks source link

Time Series Multivariate Support in Python #3

Closed cmuell89 closed 6 years ago

cmuell89 commented 6 years ago

Does your implementation (Python) support DBA for varying time series lengths or does it assume all sequences are of a length T (as mentioned in your paper).

Does your implementation support multivariate time series data?

e.g. series.shape

(number_of_time_series, length_time_series, dimensionality)

fpetitjean commented 6 years ago

It currently assumes it's the same length, but doesn't depend on it so I could make it work for variable ones if you'd like. Only thing that gets a bit tricky if the lengths are different is that you then cannot sensibly use the warping window (which makes calculations much faster). I also have a Cython implementation if you would like me to post it. Let me know, Cheers, F

cmuell89 commented 6 years ago

I will be generating the DBA resulting sequences offline so absolute speed is not all that important with regards losing the constraint window speed-up. However, releasing support for varying length sequences would be very helpful.

Does your software support multivariate sequences? My field is robotics our our sequence (trajectory) data often has many different dimensions (joints, end effector position etc,.)

Thanks for the quick response! C

fpetitjean commented 6 years ago

The software doesn't yet but the theory is exactly the same. I'll try to create you those versions tomorrow. If using multi-variate, you'll have to be a bit careful with the scale of the values in each dimension, because the Euclidean distance is used to compare the (multi-dimensional) elements of the series. Just checking, you do have the same length in all dimensions for every single series, yeah?

fpetitjean commented 6 years ago

OK, multi-lengths works - let me know if something weird. Next the multi-variate

cmuell89 commented 6 years ago

Thank you for doing this so quickly! You're saving me loads of time, especially with some deadlines looming!

Yes, I will be sure to range normalize my sequences so that no one feature accounts for the entirety of the distance using l2 norm.

cmuell89 commented 6 years ago

With regards to being unable to use the windowing constraint, is this only used during the DTW alignment between sequences and the current average. In other words, during the process of finding the associated points in each sequence with the coordinates of the average sequence?

Might you gain some speed by employing the FastDTW algorithm? See FastDTW: Toward Accurate Dynamic Time Warping in Linear Time and Space

image

cmuell89 commented 6 years ago

Another concern with regards to varying length sequences: If the initial length of the average sequence is much shorter than some of the sequences, there might be a number of large singularities where many coordinates in current sequence map to one coordinate in the average sequence. Would shorter sequences too strongly influence the barycenters of the average sequences coordinates?

fpetitjean commented 6 years ago

Hi Carl, One could indeed use any speedup on DTW for DBA, because all DBA relies on is to have an alignment path in the matrix. FastDTW would do for an approximation, PrunedDTW would also if want the exact alignment. I just don't have the time to add them to the code - do you? :) Also, for most people, setting a warping window is a reasonable thing to do, and then everything gets much faster. Note also that the Python implementation is about 50x slower than the Java one (Python doesn't like iterating through all i/j).

Regarding the length of the average sequence, you're now touching on a sore spot for DTW. The issue is actually not with DBA but with the DTW directly - DBA only minimizes the sum of squared DTW, but it is true that comparing 2 pairs of sequences will generally not have the same scale for the distance if the length differs a lot. This is why most people first uniformly stretch/contract the series to a given length, and then do DTW. Maybe this is what you want to do? Bit more info on your application? Cheers, F

Will try to get the the multivariate soon.

fpetitjean commented 6 years ago

Note, I have split the multi-variate and multiple lengths in 2 different issues

cmuell89 commented 6 years ago

I think providing an interface into the alignment aspect of DBA would be best considering how many different variations of DTW exist (FastDTW, window constraints, derivative DTW, shapeDTW etc,.), rather implementing concrete variations.

The goal of my current research project is to take human gestures and robotic gestures and classify them in real-time. The challenge is that these gestures can experience significant time warping (e.g. someone waving slowly versus quickly) and that there might be anomalies in the performance of the gesture (e.g. temporary pauses). Generation of consensus gestures best representative of the geature class is my first goal. I want to experiment with your DBA method first and then move on to more complex approaches if DBA does not capture enough information for adequate differentiation. Another approach is to search for template gestures: Prototype Optimization for Temporarily and Spatially Distorted Time Series

As for stretching my gesture trajectories to a uniform length: this may or may not be necessary to my application. I am not sure. So long as the consensus gestures are representative enough, then they should serve as good distinguishing templates from which to differentiate test gesture classes.

fpetitjean commented 6 years ago

Hi Carl, I have finished adding a multi-variate version of DBA. Regarding your comment - this is actually extremely relevant and I'm working on this, but there are some peculiarities for DTW-type spaces that make things not as simple as just providing an interface into the alignment. For instance, if you use an L1-norm inside DTW, then the solution for the average. Same thing if you have two points A and B in an Euclidean space and you want to find the point C that minimizes d(A,C)+d(B,C) (as opposed to d(A,C)² + d(B,C)²), then all points in the segment [A,B] will minimize this sum, while only the center would minimize the sum of squares. Anyway, let me know if it works for you, and always happy to hear about what people have done with DBA! Nb: the data should be in channels-first format, ie (n_series, n_channels, length). To allow for variable lengths, I have made the data to be a list of (n_channels, length) matrices. Cheers, F

cmuell89 commented 6 years ago

Thank you very much! I'll test it out today.

C

cmuell89 commented 6 years ago

With regards to data formatting, here is a toy example to make sure I've understood you correctly: Three variables/channels. Time series 1 is length 4. Time series 2 is length 6 t1 = [[c1, c2, c3], [c1, c2, c3], [c1, c2, c3], [c1, c2, c3]] t2 = [[c1, c2, c3], [c1, c2, c3], [c1, c2, c3], [c1, c2, c3], c1, c2, c3], [c1, c2, c3]]

data = [t1, t2] = [[[c1, c2, c3], [c1, c2, c3], [c1, c2, c3], [c1, c2, c3]], [[c1, c2, c3], [c1, c2, c3], [c1, c2, c3], [c1, c2, c3], c1, c2, c3], [c1, c2, c3]]]

data.shape = (n_series=2, n_channels=3, length=(4,6))

In this case length is variable so casting into ndarray might not work but this is just for conceptualization.

cmuell89 commented 6 years ago

I got it I think. Transpose of the time series examples in my previous comment.

fpetitjean commented 6 years ago

Yes, if you look at the example in the main, I create a list of time series, then for each time series I have a matrix of (n_channels,length). All good?