chuanconggao / PrefixSpan-py

The shortest yet efficient Python implementation of the sequential pattern mining algorithm PrefixSpan, closed sequential pattern mining algorithm BIDE, and generator sequential pattern mining algorithm FEAT.
https://git.io/prefixspan
MIT License
414 stars 92 forks source link

Handling of --text does not work properly #24

Closed johann-petrak closed 4 years ago

johann-petrak commented 5 years ago

The way how the word dictionary and subsequently the inverted word dictionary is created is broken.

Take this input file:

a b c
b c d
e f g b c
a f g
f b c

Running prefixspan-cli frequent 2 --closed --text test1.txt creates the following output:

e f g : 3
f g : 5
f g g : 2
f f g : 2

This is obviously wrong.

The reason is that the word dictionary created is: {'a': 0, 'b': 1, 'c': 2, 'd': 2, 'e': 0, 'f': 1, 'g': 2} As you can see, the indices are not properly mapped to words and there are duplicate indices.

The inverted map is therefore also wrong: {0: 'e', 1: 'f', 2: 'g'}

kate-egorova commented 5 years ago

I think I've encountered some bug with the input too. I'm testing it on input like that:

i t e o r i m a s u h e ch i k e N o u m a sh o m o i m a s

and prefixspan-cli frequent 2 --minlen=5 --text test_t.txt gives me this output:

i e k m a : 2 i e k m a sh : 2 i e k m sh : 2 i e k a sh : 2 i e k u sh : 2 i e m a sh : 2 i k m a sh : 2 i m a s u : 2 ch e k m a : 2 ch e k m a sh : 2 ch e k m sh : 2 ch e k a sh : 2 ch e k u sh : 2 ch e m a sh : 2 ch k m a sh : 2 ch i m a sh : 2 e k m a sh : 2 e i m a sh : 2 k i m a s : 2 k i m a s u : 2 k i m a u : 2 k i m s u : 2 k i a s u : 2 k m a s u : 2

It looks like it treated the inputs as one line, but then ch e k m a sh shouldn't exist at all? I wonder if it's confused by 2-item elements like ch and sh?

supernova-eng commented 5 years ago

Given a test input of:

a b c d e
b e d a
a e b c a
b c e
c e b a
a e a e

And running prefixspan-cli frequent 1 --text exp-2-test.txt > out_xnew2.txt will result in something like this:

a : 5
a b : 2
a b c : 2
a b c d : 1
a b c d e : 1
a b c e : 1
a b c a : 1
a b d : 1
a b d e : 1
a b e : 1
a b a : 1
a c : 2
a c d : 1
a c d e : 1
a c e : 1
a c a : 1
a d : 1
a d e : 1
a e : 3
a e b : 1
a e b c : 1
a e b c a : 1
a e b a : 1
a e c : 1
a e c a : 1
a e a : 2
a e a e : 1
a e e : 1
a a : 2
a a e : 1
b : 5
b c : 3
b c d : 1
b c d e : 1
b c e : 2
b c a : 1
b d : 2
b d e : 1
b d a : 1
b e : 3
b e d : 1
b e d a : 1
b e a : 1
b a : 3
c : 4
c d : 1
c d e : 1
c e : 3
c e b : 1
c e b a : 1
c e a : 1
c a : 2
c b : 1
c b a : 1
d : 2
d e : 1
d a : 1
e : 6
e d : 1
e d a : 1
e a : 4
e a e : 1
e b : 2
e b c : 1
e b c a : 1
e b a : 2
e c : 1
e c a : 1
e e : 1

This does not seem to be the correct breakdown. @chuanconggao - could you please review https://github.com/chuanconggao/PrefixSpan-py/pull/25 and merge it, since it addresses the core problem here?

chuanconggao commented 4 years ago

Merged #25 to fix this. Please reopen if there is further issue.