chuanconggao / PrefixSpan-py

The shortest yet efficient Python implementation of the sequential pattern mining algorithm PrefixSpan, closed sequential pattern mining algorithm BIDE, and generator sequential pattern mining algorithm FEAT.
https://git.io/prefixspan
MIT License
414 stars 92 forks source link

Algorithm outputs a series of repeated items but there are none in the training data #11

Closed ghost closed 5 years ago

ghost commented 6 years ago

Hallo,

I have noticed a behaviour that, to me, is a bit strange. I trained the algorithm with a series of sequences that had no repeated items, i.e. it's not possible that an item appears again immediately after itself, like 1 in the sequence [3, 2, 1, 1, 5, 7, 2].

When I generated the most frequent sequences, though, I obtained repeated items. Is it possible?

For example, given the code: seqs = [[22, 16], [22, 21], [22, 16, 14, 20], [22, 16], [22, 16, 34, 24, 26, 24, 26, 14, 13], [22, 16], [22, 26], [22, 13, 34], [22, 16], [22, 21, 16]]

ps = PrefixSpan(seqs) ps.minlen = 2 ps.maxlen = 10

freq_ratio = 0.1 freq = np.ceil(freq_ratio * len(seqs)).astype(int)

res = ps.frequent(freq)

The output has [26, 26, 14, 13]

I just made a small reproducible example, in my case the sequence dataset is ~1000 sequences. But the problem remains.

Thanks

chuanconggao commented 6 years ago

Hi, your relative support threshold is 0.1. Thus, your absolute support threshold is 1 for your input of 10 sequences.

This means it will generate all the possible subsequences with gap in between.

On Wed, Nov 21, 2018 at 6:30 AM marcwell notifications@github.com wrote:

Hallo,

I have noticed a behaviour that, to me, is a bit strange. I trained the algorithm with a series of sequences that had no repeated items, i.e. it's not possible that an item appears again immediately after itself, like 1 in the sequence [3, 2, 1, 1, 5, 7, 2].

When I generated the most frequent sequences, though, I obtained repeated items. Is it possible?

For example, given the code: `seqs = [[22, 16], [22, 21], [22, 16, 14, 20], [22, 16], [22, 16, 34, 24, 26, 24, 26, 14, 13], [22, 16], [22, 26], [22, 13, 34], [22, 16], [22, 21, 16]]

ps = PrefixSpan(seqs) ps.minlen = 2 ps.maxlen = 10

freq_ratio = 0.1 freq = np.ceil(freq_ratio * len(seqs)).astype(int)

res = ps.frequent(freq)`

The output has [26, 26, 14, 13]

I just made a small reproducible example, in my case the sequence dataset is ~1000 sequences. But the problem remains.

Thanks

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/chuanconggao/PrefixSpan-py/issues/11, or mute the thread https://github.com/notifications/unsubscribe-auth/AGpCEaafp14gtdZ_lZ3Bl_5fV-1t3bQLks5uxWOYgaJpZM4YtO-t .

ghost commented 6 years ago

Thank you for your answer! Here I just generated a small set of rules, so that it can fit in a post, but it happens also on the set of ~1000 sequences I'm analysing, like:

[22, 30, 30] with support 156 (13.3%)

Is it normal?

chuanconggao commented 6 years ago

I am really not sure with just description. Can you provide a tiny sample?

ghost commented 6 years ago

I have attached a file with some example sequences. It does not contain sequences with repeated items (i.e. where the same number appears once and then immediately again) but in the output I obtain, for example:

(156, [22, 30, 30])

Thanks for your help

Attached file: seqs.txt

chuanconggao commented 5 years ago

Hi, you seem to misunderstand the concept of pattern.

For example for one of your provided sequence [22, 1, 30, 1, 24, 30], pattern []22, 30, 30 IS a sub-pattern of this sequence. It is allowed to have other items in between.