benibaeumle / FSS-Algorithm

A python NumPy implementation of the shapelet selection algorithm from the paper Ji et al., „A Fast Shapelet Selection Algorithm for Time Series Classification“.
MIT License
13 stars 3 forks source link

Hello, could you tell me how to calculate this value, std_split ? #1

Open Mr-Wu-H opened 2 years ago

Mr-Wu-H commented 2 years ago

std_split : float the standard deviation from the mean to subdivide the time series of a class into subclasses.

benibaeumle commented 2 years ago

Hello, in their paper the authors just write Next, we calculate the standard deviation value of these adjacent discrepancies. At last, we can separate the data into subclasses by splitting at the sequence that has difference larger than half of the computed standard deviation.. Unfortunately, I have no justified argument for you on how to best choose a value for std_split. The obvious thing is, with higher values you sample fewer time series as you pack more and more distinct time series into the same subclass. Maybe visualizing the adjacent discrepancies along with different split values might give you an indication.

Mr-Wu-H commented 2 years ago

Thank you very much for your reply.

------------------ 原始邮件 ------------------ 发件人: "benibaeumle/FSS-Algorithm" @.>; 发送时间: 2022年4月13日(星期三) 下午5:32 @.>; @.**@.>; 主题: Re: [benibaeumle/FSS-Algorithm] Hello, could you tell me how to calculate this value, std_split ? (Issue #1)

Hello, in their paper the authors just write Next, we calculate the standard deviation value of these adjacent discrepancies. At last, we can separate the data into subclasses by splitting at the sequence that has difference larger than half of the computed standard deviation.. Unfortunately, I have no justified argument for you on how to best choose a value for std_split. The obvious thing is, with higher values you sample fewer time series as you pack more and more distinct time series into the same subclass. Maybe visualizing the adjacent discrepancies along with different split values might give you an indication.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

Mr-Wu-H commented 2 years ago

Hello, in their paper the authors just write Next, we calculate the standard deviation value of these adjacent discrepancies. At last, we can separate the data into subclasses by splitting at the sequence that has difference larger than half of the computed standard deviation.. Unfortunately, I have no justified argument for you on how to best choose a value for std_split. The obvious thing is, with higher values you sample fewer time series as you pack more and more distinct time series into the same subclass. Maybe visualizing the adjacent discrepancies along with different split values might give you an indication.

Hello, this sentence“we can separate the data into subclasses by splitting at the sequence that has difference larger than half of the computed standard deviation” means that remove the part greater than standard deviation and the rest time series as the sample time series.Am I right?

Mr-Wu-H commented 2 years ago

Hello, this method FastShapeletCandidates is to get shapelet candidates of one class, right?

benibaeumle commented 2 years ago

Hello, in their paper the authors just write Next, we calculate the standard deviation value of these adjacent discrepancies. At last, we can separate the data into subclasses by splitting at the sequence that has difference larger than half of the computed standard deviation.. Unfortunately, I have no justified argument for you on how to best choose a value for std_split. The obvious thing is, with higher values you sample fewer time series as you pack more and more distinct time series into the same subclass. Maybe visualizing the adjacent discrepancies along with different split values might give you an indication.

Hello, this sentence“we can separate the data into subclasses by splitting at the sequence that has difference larger than half of the computed standard deviation” means that remove the part greater than standard deviation and the rest time series as the sample time series.Am I right?

I am not sure if I understand you correctly. Please, see the paper chapter 3.1 for how this particular step is computed (I do not have Latex support when answering here, so having a look on the paper should be more comfortable for you). But in words, what is computed is:

  1. Calculate the sum of the time steps of each time series

    1. Calculate the mean over the sums
    2. Select the time series which is closest to the mean over the sums
    3. Calculate the euclidean distances of each time series to the time series we selected in 3. and sort the resulting list of distances
    4. Calculate the standard deviation of the differences between each pair of neighboring distances
    5. Now, for each pair in the sorted list of distances we check if the difference is larger than 1.5x the standard deviation we calculated in 5.
    6. If the standard deviation is larger than 1.5 we consider the time series between the last split point and the current split point as a subclass.
    7. Repeat 7 until we iterated over all neighboring distance pairs

    The result after computing the 8 steps above is the set of subclasses.

benibaeumle commented 2 years ago

Hello, this method FastShapeletCandidates is to get shapelet candidates of one class, right?

Yes.

Mr-Wu-H commented 2 years ago

Thanks a lot.

Mr-Wu-H commented 2 years ago

Hello, in their paper the authors just write Next, we calculate the standard deviation value of these adjacent discrepancies. At last, we can separate the data into subclasses by splitting at the sequence that has difference larger than half of the computed standard deviation.. Unfortunately, I have no justified argument for you on how to best choose a value for std_split. The obvious thing is, with higher values you sample fewer time series as you pack more and more distinct time series into the same subclass. Maybe visualizing the adjacent discrepancies along with different split values might give you an indication.

Hello, this sentence“we can separate the data into subclasses by splitting at the sequence that has difference larger than half of the computed standard deviation” means that remove the part greater than standard deviation and the rest time series as the sample time series.Am I right?

I am not sure if I understand you correctly. Please, see the paper chapter 3.1 for how this particular step is computed (I do not have Latex support when answering here, so having a look on the paper should be more comfortable for you). But in words, what is computed is:

  1. Calculate the sum of the time steps of each time series
  2. Calculate the mean over the sums
  3. Select the time series which is closest to the mean over the sums
  4. Calculate the euclidean distances of each time series to the time series we selected in 3. and sort the resulting list of distances
  5. Calculate the standard deviation of the differences between each pair of neighboring distances
  6. Now, for each pair in the sorted list of distances we check if the difference is larger than 1.5x the standard deviation we calculated in 5.
  7. If the standard deviation is larger than 1.5 we consider the time series between the last split point and the current split point as a subclass.
  8. Repeat 7 until we iterated over all neighboring distance pairs

The result after computing the 8 steps above is the set of subclasses.

Hello,how should I understand the last split point and the current split point in step 7?

benibaeumle commented 2 years ago

See here.

Mr-Wu-H commented 2 years ago

See here.

Thanks a lot.In your demo,the data set ,fordA_sample, will generate 6300 shapelets.Do you know how to remove those that may overlap shapelets to reduce time complexity?