Open seanlaw opened 4 years ago
Here is the Shapelet Discovery data . Note that there is already a train/test split and in each dataset. For the paper, we are interested in the gun_train
and where the response column is in the first column, with 0
being "gun" (24 rows) and 1
being "nogun" (26 rows).
The data can be read with:
import pandas as pd
df = pd.read_csv("gun_train", sep="\s+", header=None)
A Tutorial that reproduces the Shapelet Discovery paper can be found here
Note that there is a post-processing step that the user will need to perform as the Shapelet computation will often result in subtracting one np.inf
from another np.inf
. This will be explicitly covered in the tutorial.
Hi @seanlaw, I have recently bumped into STUMPY and I am a big fan of your work! Do you still need help with this issue? If so, I will start drafting a Jupyter notebook with each of the 10 use-cases described in the paper. Also, How would you title the tutorial line item in RTD?
@Attol8 Thank you for your kind words and, absolutely yes, we are looking for and welcome help on this issue!
If you don't mind, the approach that I'd like to take with this is to split up each of the 10 use cases into its own individual notebook (see this incomplete Shapelets example use case) and place them into the docs/
directory. This way, the scope for each notebook is small, achievable, and anybody can tackle a different use case. As you work on one use case, it would be best to create a new issue for that one use case so its progress can be tracked (see this incomplete Shapelets Tutorial Issue). Once we get a few (3?) use cases completed, then on my end I can handle the best way to display it in RTD (or maybe you already have some ideas and I'd be happy to discuss). This way, the writing of the notebooks will be separated from how/when it is presented to the user in our documentation (which may take a bit more thinking/planning).
For each use case, I'd recommend the following rough checklist
docs/
I am flexible and open to your thoughts/ideas too. How does that sound?
Thanks for the detailed answer @seanlaw ! As you suggested, I agree that the best way to proceed is: notebooks first, reviews seconds, implementation of the docs third. Following your checklist, I will now get to work and post updates as soon as I have something ready
Awesome and thanks again for contributing to our community, @Attol8! Please choose a use case and create an issue for it so that I know which one you are working on and then I can better assist you. Let's continue the conversations on the corresponding issue(s). I am here to help!
For reconfirmation of current situation, 11 use cases on Matrix Profile Top Ten paper is planed to be splitted tutorials. Tutorials on the paper is followings
3.7 is Done.
I intend to work from the top.
I couldn't figure out whether dataset is available or not.
[3.1]
Fig3.MALLAT dataset (Data is not clear)
Description: We took two exemplars from the same class from the MALLAT dataset (Chen et al. 2015)
=>(Chen et al. 2015) is ucr and MALLAT is available.
But Time series length is far different in comparing Fig3(10000) with ucr(1000).
Appropriate dataset? (Data is not clear)
Description: We deliberately chose this dataset, from the 85 in the UCR archive.
This is because complex time series (see Batista et al. 2014)
=> the 85 in the UCR archive, somewhere in ucr?.
=> (see Batista et al. 2014) is paper of complexity-invariant distance.
So dataset should be the kind of these.
But No code example, no figure.
Fig4.ElectriSense dataset (Data is not clear)
Description: Two non-contagious snippets from the ElectriSense dataset (Gupta et al. 2010).
=>Gupta et al. 2010
[3.2]
[3.3]
1.Search dataset more
2.Define replacement of dataset
In my opinion, these 3 examples are really similar.
MP calculation with simple preprocess like [stretch, flip, invert] is features.
So those can be combined tutorials with alternative dataset.
Please share your idea. I'll dig up [3.3] dataset at the first.
In my opinion, these 3 examples are really similar. MP calculation with simple preprocess like [stretch, flip, invert] is features. So those can be combined tutorials with alternative dataset. Please share your idea. I'll dig up [3.3] dataset at the first.
I think the most important thing is to find the data. If it is not available or if a reasonable replacement is not available then we should skip it. For now, let's not even worry about the tutorial and instead focus on whether we can reproduce the work. If not, then it may not even make sense to make a tutorial for any of this.
@ken-maeda How does that sound? Does that make sense?
Thanks, @seanlaw I'll focus on finding the data and reasonable replacement.
I couldn't find the dataset.
On the other hand I created replacement by combining those original dataset. However I'm not sure context of paper is followed.
As 3rd option, I tried duplication from paper image.
paper sample
duplication series
plt.plot(x_dummy2[ind10:ind11], y_dummy[ind10:ind11], color="lime", linewidth=8) plt.plot(x_dummy2[ind20:ind21], y_dummy[ind20:ind21], color="aqua", linewidth=8) plt.plot(y_dummy, color="grey")
If it is acceptable, it can't be necessary that finding the dataset. @seanlaw How do you think?
@ken-maeda The "duplicate series" (created by manually by you) looks identical to the "paper sample". Can you share the code for generating the duplicate series? Is this for 3.3?
It seems that this dataset is quite trivial. Is there any preprocessing needed (i.e., flipping, stretching, inverting, etc)?
This dataset is for first example Fig.3 of 3.1. The reason why i choose this dataset is just first example on paper. But if this works well, it can be applied to other example.
import numpy as np
import cv2
import matplotlib.pyplot as plt
from scipy import interpolate
# load series image
img = cv2.imread('image/31_paper_fig3.png')
img_line = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# remove hilighted area(green and blue)
value_gray = 100 # 100 is identified by hist
img_line[np.where(img_line > value_gray)] = 255
point = np.where(img_line <= value_gray)
x, y = point[1], -point[0]
# extract ploting area
# time seris data should be 1 point/1 time_id
x = (x - np.min(x))/(np.max(x) - np.min(x))
y = (y - np.max(y))/(np.max(y) - np.min(y)) + 1
x_unique = np.unique(x)
y_unique = []
for i in x_unique:
y_cand = y[np.where(x == i)[0]]
y_unique.append(np.mean(y_cand)) # pixel can be in conflicted pos
y_uniuqe = np.array(y_unique)
# stretch dataset length
target_length = 10048 # on paper
x_dummy = np.arange(0, 1, 1/10048, dtype=float)
fitted_curve = interpolate.PchipInterpolator(x_unique, y_uniuqe)
y_dummy = fitted_curve(x_dummy)
# result
plt.figure(figsize=(15, 2))
plt.plot(y_dummy)
Loaded image is
If this dataset is acceptable, I can preprocess flipping, ..... by following the paper.
@ken-maeda Please go ahead and try it out!
I'm ready for first draft of 3.1. I want you to check flow of the tutorial. Should I upload dataset to somewhere at this stage? or just pull request with csv, pkl file?
I'm ready for first draft of 3.1. I want you to check flow of the tutorial. Should I upload dataset to somewhere at this stage? or just pull request with csv, pkl file?
@ken-maeda Let's not worry about uploading the data yet. Can you please:
docs/Tutorial_Matrix_Profile_Top_Ten.ipynb
And then I can provide feedback.
If we need the data then we can post it as a comment directly in the PR:
I'm almost ready for 3.3 notebook. About 3.2, original dataset is missing and I'm not familiar with those sound converting tech. So I skipped it. Btw 3.3 notebook is really short : )
It would be nice to add a tutorial(s) that reproduces the Matrix Profile Top Ten paper. The accompanying data at their Google sites page can be found here.
It might be best to make the individual top ten sections as separate items (i.e., sub-list) that rolls under one tutorial line item in RTD and that can expand to ten sub items
Sent with GitHawk