Add Tutorial(s) that Reproduces “Matrix Profile Top Ten” Paper

seanlaw commented 4 years ago

It would be nice to add a tutorial(s) that reproduces the Matrix Profile Top Ten paper. The accompanying data at their Google sites page can be found here.

It might be best to make the individual top ten sections as separate items (i.e., sub-list) that rolls under one tutorial line item in RTD and that can expand to ten sub items

_{Sent with GitHawk}

seanlaw commented 4 years ago

Here is the Shapelet Discovery data . Note that there is already a train/test split and in each dataset. For the paper, we are interested in the gun_train and where the response column is in the first column, with 0 being "gun" (24 rows) and 1 being "nogun" (26 rows).

The data can be read with:

import pandas as pd

df = pd.read_csv("gun_train", sep="\s+", header=None)

A Tutorial that reproduces the Shapelet Discovery paper can be found here

Note that there is a post-processing step that the user will need to perform as the Shapelet computation will often result in subtracting one np.inf from another np.inf. This will be explicitly covered in the tutorial.

Attol8 commented 4 years ago

Hi @seanlaw, I have recently bumped into STUMPY and I am a big fan of your work! Do you still need help with this issue? If so, I will start drafting a Jupyter notebook with each of the 10 use-cases described in the paper. Also, How would you title the tutorial line item in RTD?

seanlaw commented 4 years ago

@Attol8 Thank you for your kind words and, absolutely yes, we are looking for and welcome help on this issue!

If you don't mind, the approach that I'd like to take with this is to split up each of the 10 use cases into its own individual notebook (see this incomplete Shapelets example use case) and place them into the docs/ directory. This way, the scope for each notebook is small, achievable, and anybody can tackle a different use case. As you work on one use case, it would be best to create a new issue for that one use case so its progress can be tracked (see this incomplete Shapelets Tutorial Issue). Once we get a few (3?) use cases completed, then on my end I can handle the best way to display it in RTD (or maybe you already have some ideas and I'd be happy to discuss). This way, the writing of the notebooks will be separated from how/when it is presented to the user in our documentation (which may take a bit more thinking/planning).

For each use case, I'd recommend the following rough checklist

[ ] Create a new issue for the use case
[ ] Clone the repository and create a new branch for the use case
[ ] Find and gather the original data set (and reference it in the issue)
[ ] Create a new notebook tutorial in docs/
[ ] Write the Python code to reproduce the example from the "Matrix Profile Top 10" paper (note that this is strictly code)
[ ] Add written text to complement and elaborate on the code as well as any useful interpretations of the results
[ ] Submit "work-in-progress" pull requests to get help and for any reviews that you may want

I am flexible and open to your thoughts/ideas too. How does that sound?

Attol8 commented 4 years ago

Thanks for the detailed answer @seanlaw ! As you suggested, I agree that the best way to proceed is: notebooks first, reviews seconds, implementation of the docs third. Following your checklist, I will now get to work and post updates as soon as I have something ready

seanlaw commented 4 years ago

Awesome and thanks again for contributing to our community, @Attol8! Please choose a use case and create an issue for it so that I know which one you are working on and then I can better assist you. Let's continue the conversations on the corresponding issue(s). I am here to help!

ken-maeda commented 2 years ago

For reconfirmation of current situation, 11 use cases on Matrix Profile Top Ten paper is planed to be splitted tutorials. Tutorials on the paper is followings

[ ] 3.1 Discovering motifs under uniform scaling
[ ] 3.2 Discovering time series semordnilaps
[ ] 3.3 Discovering time series reverse complements
[ ] 3.4 Segmenting repetitive exercises
[ ] 3.5 Robust distance functions
[ ] 3.6 Meter‑swapping detection
[x] 3.7 Shapelet discovery
[ ] 3.8 Detecting and locating low frequency earthquakes
[ ] 3.9 Automatically clustering time series motifs
[ ] 3.10 Quantifying Parkinson disease
[ ] 3.11 ~~Scalability~~

3.7 is Done.

I intend to work from the top.

I couldn't figure out whether dataset is available or not.
[3.1]

Fig3.MALLAT dataset (Data is not clear)
Description: We took two exemplars from the same class from the MALLAT dataset (Chen et al. 2015)
=>(Chen et al. 2015) is ucr and MALLAT is available.
But Time series length is far different in comparing Fig3(10000) with ucr(1000).

Where is dataset?
Appropriate dataset? (Data is not clear)
Description: We deliberately chose this dataset, from the 85 in the UCR archive.
This is because complex time series (see Batista et al. 2014)
=> the 85 in the UCR archive, somewhere in ucr?.
=> (see Batista et al. 2014) is paper of complexity-invariant distance.
So dataset should be the kind of these. But No code example, no figure.

Which is the 85 in the UCR archive?
Fig4.ElectriSense dataset (Data is not clear)
Description: Two non-contagious snippets from the ElectriSense dataset (Gupta et al. 2010).
=>Gupta et al. 2010

Where is dataset?

[3.2]

Performance by the Tafelmusik Orchestra (Converting parameter is unknown)
Description: directed by Bruno Weil in 1993 (Music Performance 2017)
we converted it to Mel-frequency cepstral coefficients (MFCC) using windows with 0.5 s and 50% of overlap
=>Music Performance 2017. DELETED
Joseph Haydn / Symphony No. 47 in G major "Palindrome" (Weil):similar song
=>coefficients (MFCC) using windows with 0.5 s and 50% of overlap
mp3 data have to be converted by MFCC. But it seems more parameters have to be set.

[3.3]

hip-worn accelerometer (Example series is unknown)
Description: It shows the y-axis from a hip-worn accelerometer from the USC-HAD Database
Approximately 2 min from a dataset from a hip-worn accelerometer of quotidian activity.
=>USC-HAD Database
840 files and each files 6 series(5040series in total)
=> Approximately 2 min
According to dataset readme, sampling rate is 100Hz.
With this rate, only around 10 series can be candidate, however they aren't target data.
I think "2min" is wrong info.

Action plan

1.Search dataset more
2.Define replacement of dataset

In my opinion, these 3 examples are really similar.
MP calculation with simple preprocess like [stretch, flip, invert] is features.
So those can be combined tutorials with alternative dataset.

Please share your idea. I'll dig up [3.3] dataset at the first.

seanlaw commented 2 years ago

In my opinion, these 3 examples are really similar. MP calculation with simple preprocess like [stretch, flip, invert] is features. So those can be combined tutorials with alternative dataset. Please share your idea. I'll dig up [3.3] dataset at the first.

I think the most important thing is to find the data. If it is not available or if a reasonable replacement is not available then we should skip it. For now, let's not even worry about the tutorial and instead focus on whether we can reproduce the work. If not, then it may not even make sense to make a tutorial for any of this.

@ken-maeda How does that sound? Does that make sense?

ken-maeda commented 2 years ago

Thanks, @seanlaw I'll focus on finding the data and reasonable replacement.

ken-maeda commented 2 years ago

I couldn't find the dataset. On the other hand I created replacement by combining those original dataset. However I'm not sure context of paper is followed.
As 3rd option, I tried duplication from paper image.

paper sample 31_paper_fig3a

duplication series

plt.plot(x_dummy2[ind10:ind11], y_dummy[ind10:ind11],
color="lime", linewidth=8)
plt.plot(x_dummy2[ind20:ind21], y_dummy[ind20:ind21],
color="aqua", linewidth=8)
plt.plot(y_dummy, color="grey")

If it is acceptable, it can't be necessary that finding the dataset. @seanlaw How do you think?

seanlaw commented 2 years ago

@ken-maeda The "duplicate series" (created by manually by you) looks identical to the "paper sample". Can you share the code for generating the duplicate series? Is this for 3.3?

It seems that this dataset is quite trivial. Is there any preprocessing needed (i.e., flipping, stretching, inverting, etc)?

ken-maeda commented 2 years ago

This dataset is for first example Fig.3 of 3.1. The reason why i choose this dataset is just first example on paper. But if this works well, it can be applied to other example.


import numpy as np
import cv2
import matplotlib.pyplot as plt
from scipy import interpolate

# load series image
img = cv2.imread('image/31_paper_fig3.png')
img_line = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

# remove hilighted area(green and blue)
value_gray = 100  # 100 is identified by hist
img_line[np.where(img_line > value_gray)] = 255
point = np.where(img_line <= value_gray)
x, y = point[1], -point[0]

# extract ploting area
# time seris data should be 1 point/1 time_id
x = (x - np.min(x))/(np.max(x) - np.min(x))
y = (y - np.max(y))/(np.max(y) - np.min(y)) + 1
x_unique = np.unique(x)

y_unique = []
for i in x_unique:
    y_cand = y[np.where(x == i)[0]]
    y_unique.append(np.mean(y_cand))  # pixel can be in conflicted pos

y_uniuqe = np.array(y_unique)

# stretch dataset length
target_length = 10048  # on paper
x_dummy = np.arange(0, 1, 1/10048, dtype=float)

fitted_curve = interpolate.PchipInterpolator(x_unique, y_uniuqe)
y_dummy = fitted_curve(x_dummy)

# result
plt.figure(figsize=(15, 2))
plt.plot(y_dummy)

Loaded image is 31_paper_fig3

If this dataset is acceptable, I can preprocess flipping, ..... by following the paper.

seanlaw commented 2 years ago

@ken-maeda Please go ahead and try it out!

ken-maeda commented 2 years ago

I'm ready for first draft of 3.1. I want you to check flow of the tutorial. Should I upload dataset to somewhere at this stage? or just pull request with csv, pkl file?

seanlaw commented 2 years ago

I'm ready for first draft of 3.1. I want you to check flow of the tutorial. Should I upload dataset to somewhere at this stage? or just pull request with csv, pkl file?

@ken-maeda Let's not worry about uploading the data yet. Can you please:

Add a new Jupyter notebook in docs/Tutorial_Matrix_Profile_Top_Ten.ipynb
Add your tutorial code/contents in this notebook
Create a new pull request (PR)

And then I can provide feedback.

If we need the data then we can post it as a comment directly in the PR:

file

ken-maeda commented 2 years ago

I'm almost ready for 3.3 notebook. About 3.2, original dataset is missing and I'm not familiar with those sound converting tech. So I skipped it. Btw 3.3 notebook is really short : )

TDAmeritrade / stumpy

Add Tutorial(s) that Reproduces “Matrix Profile Top Ten” Paper #85

Where is dataset?

Which is the 85 in the UCR archive?

Where is dataset?

Action plan