TDAmeritrade / stumpy

STUMPY is a powerful and scalable Python library for modern time series analysis
https://stumpy.readthedocs.io/en/latest/
Other
3.64k stars 317 forks source link

Add Tutorial(s) that Reproduces “Matrix Profile Top Ten” Paper #85

Open seanlaw opened 4 years ago

seanlaw commented 4 years ago

It would be nice to add a tutorial(s) that reproduces the Matrix Profile Top Ten paper. The accompanying data at their Google sites page can be found here.

It might be best to make the individual top ten sections as separate items (i.e., sub-list) that rolls under one tutorial line item in RTD and that can expand to ten sub items

Sent with GitHawk

seanlaw commented 4 years ago

Here is the Shapelet Discovery data . Note that there is already a train/test split and in each dataset. For the paper, we are interested in the gun_train and where the response column is in the first column, with 0 being "gun" (24 rows) and 1 being "nogun" (26 rows).

The data can be read with:

import pandas as pd

df = pd.read_csv("gun_train", sep="\s+", header=None)

A Tutorial that reproduces the Shapelet Discovery paper can be found here

Note that there is a post-processing step that the user will need to perform as the Shapelet computation will often result in subtracting one np.inf from another np.inf. This will be explicitly covered in the tutorial.

Attol8 commented 4 years ago

Hi @seanlaw, I have recently bumped into STUMPY and I am a big fan of your work! Do you still need help with this issue? If so, I will start drafting a Jupyter notebook with each of the 10 use-cases described in the paper. Also, How would you title the tutorial line item in RTD?

seanlaw commented 4 years ago

@Attol8 Thank you for your kind words and, absolutely yes, we are looking for and welcome help on this issue!

If you don't mind, the approach that I'd like to take with this is to split up each of the 10 use cases into its own individual notebook (see this incomplete Shapelets example use case) and place them into the docs/ directory. This way, the scope for each notebook is small, achievable, and anybody can tackle a different use case. As you work on one use case, it would be best to create a new issue for that one use case so its progress can be tracked (see this incomplete Shapelets Tutorial Issue). Once we get a few (3?) use cases completed, then on my end I can handle the best way to display it in RTD (or maybe you already have some ideas and I'd be happy to discuss). This way, the writing of the notebooks will be separated from how/when it is presented to the user in our documentation (which may take a bit more thinking/planning).

For each use case, I'd recommend the following rough checklist

I am flexible and open to your thoughts/ideas too. How does that sound?

Attol8 commented 4 years ago

Thanks for the detailed answer @seanlaw ! As you suggested, I agree that the best way to proceed is: notebooks first, reviews seconds, implementation of the docs third. Following your checklist, I will now get to work and post updates as soon as I have something ready

seanlaw commented 4 years ago

Awesome and thanks again for contributing to our community, @Attol8! Please choose a use case and create an issue for it so that I know which one you are working on and then I can better assist you. Let's continue the conversations on the corresponding issue(s). I am here to help!

ken-maeda commented 2 years ago

For reconfirmation of current situation, 11 use cases on Matrix Profile Top Ten paper is planed to be splitted tutorials. Tutorials on the paper is followings

3.7 is Done.

I intend to work from the top.

I couldn't figure out whether dataset is available or not.
[3.1]

[3.2]

[3.3]

Action plan

1.Search dataset more
2.Define replacement of dataset

In my opinion, these 3 examples are really similar.
MP calculation with simple preprocess like [stretch, flip, invert] is features.
So those can be combined tutorials with alternative dataset.

Please share your idea. I'll dig up [3.3] dataset at the first.

seanlaw commented 2 years ago

In my opinion, these 3 examples are really similar. MP calculation with simple preprocess like [stretch, flip, invert] is features. So those can be combined tutorials with alternative dataset. Please share your idea. I'll dig up [3.3] dataset at the first.

I think the most important thing is to find the data. If it is not available or if a reasonable replacement is not available then we should skip it. For now, let's not even worry about the tutorial and instead focus on whether we can reproduce the work. If not, then it may not even make sense to make a tutorial for any of this.

@ken-maeda How does that sound? Does that make sense?

ken-maeda commented 2 years ago

Thanks, @seanlaw I'll focus on finding the data and reasonable replacement.

ken-maeda commented 2 years ago

I couldn't find the dataset. On the other hand I created replacement by combining those original dataset. However I'm not sure context of paper is followed.
As 3rd option, I tried duplication from paper image.

paper sample 31_paper_fig3a

duplication series image

plt.plot(x_dummy2[ind10:ind11], y_dummy[ind10:ind11],
color="lime", linewidth=8)
plt.plot(x_dummy2[ind20:ind21], y_dummy[ind20:ind21],
color="aqua", linewidth=8)
plt.plot(y_dummy, color="grey")

If it is acceptable, it can't be necessary that finding the dataset. @seanlaw How do you think?

seanlaw commented 2 years ago

@ken-maeda The "duplicate series" (created by manually by you) looks identical to the "paper sample". Can you share the code for generating the duplicate series? Is this for 3.3?

It seems that this dataset is quite trivial. Is there any preprocessing needed (i.e., flipping, stretching, inverting, etc)?

ken-maeda commented 2 years ago

This dataset is for first example Fig.3 of 3.1. The reason why i choose this dataset is just first example on paper. But if this works well, it can be applied to other example.


import numpy as np
import cv2
import matplotlib.pyplot as plt
from scipy import interpolate

# load series image
img = cv2.imread('image/31_paper_fig3.png')
img_line = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

# remove hilighted area(green and blue)
value_gray = 100  # 100 is identified by hist
img_line[np.where(img_line > value_gray)] = 255
point = np.where(img_line <= value_gray)
x, y = point[1], -point[0]

# extract ploting area
# time seris data should be 1 point/1 time_id
x = (x - np.min(x))/(np.max(x) - np.min(x))
y = (y - np.max(y))/(np.max(y) - np.min(y)) + 1
x_unique = np.unique(x)

y_unique = []
for i in x_unique:
    y_cand = y[np.where(x == i)[0]]
    y_unique.append(np.mean(y_cand))  # pixel can be in conflicted pos

y_uniuqe = np.array(y_unique)

# stretch dataset length
target_length = 10048  # on paper
x_dummy = np.arange(0, 1, 1/10048, dtype=float)

fitted_curve = interpolate.PchipInterpolator(x_unique, y_uniuqe)
y_dummy = fitted_curve(x_dummy)

# result
plt.figure(figsize=(15, 2))
plt.plot(y_dummy)

Loaded image is 31_paper_fig3

If this dataset is acceptable, I can preprocess flipping, ..... by following the paper.

seanlaw commented 2 years ago

@ken-maeda Please go ahead and try it out!

ken-maeda commented 2 years ago

I'm ready for first draft of 3.1. I want you to check flow of the tutorial. Should I upload dataset to somewhere at this stage? or just pull request with csv, pkl file?

seanlaw commented 2 years ago

I'm ready for first draft of 3.1. I want you to check flow of the tutorial. Should I upload dataset to somewhere at this stage? or just pull request with csv, pkl file?

@ken-maeda Let's not worry about uploading the data yet. Can you please:

  1. Add a new Jupyter notebook in docs/Tutorial_Matrix_Profile_Top_Ten.ipynb
  2. Add your tutorial code/contents in this notebook
  3. Create a new pull request (PR)

And then I can provide feedback.

If we need the data then we can post it as a comment directly in the PR:

file

ken-maeda commented 2 years ago

I'm almost ready for 3.3 notebook. About 3.2, original dataset is missing and I'm not familiar with those sound converting tech. So I skipped it. Btw 3.3 notebook is really short : )