Closed johann-petrak closed 4 years ago
I think I've encountered some bug with the input too. I'm testing it on input like that:
i t e o r i m a s u h e ch i k e N o u m a sh o m o i m a s
and prefixspan-cli frequent 2 --minlen=5 --text test_t.txt
gives me this output:
i e k m a : 2 i e k m a sh : 2 i e k m sh : 2 i e k a sh : 2 i e k u sh : 2 i e m a sh : 2 i k m a sh : 2 i m a s u : 2 ch e k m a : 2 ch e k m a sh : 2 ch e k m sh : 2 ch e k a sh : 2 ch e k u sh : 2 ch e m a sh : 2 ch k m a sh : 2 ch i m a sh : 2 e k m a sh : 2 e i m a sh : 2 k i m a s : 2 k i m a s u : 2 k i m a u : 2 k i m s u : 2 k i a s u : 2 k m a s u : 2
It looks like it treated the inputs as one line, but then ch e k m a sh shouldn't exist at all? I wonder if it's confused by 2-item elements like ch and sh?
Given a test input of:
a b c d e
b e d a
a e b c a
b c e
c e b a
a e a e
And running prefixspan-cli frequent 1 --text exp-2-test.txt > out_xnew2.txt
will result in something like this:
a : 5
a b : 2
a b c : 2
a b c d : 1
a b c d e : 1
a b c e : 1
a b c a : 1
a b d : 1
a b d e : 1
a b e : 1
a b a : 1
a c : 2
a c d : 1
a c d e : 1
a c e : 1
a c a : 1
a d : 1
a d e : 1
a e : 3
a e b : 1
a e b c : 1
a e b c a : 1
a e b a : 1
a e c : 1
a e c a : 1
a e a : 2
a e a e : 1
a e e : 1
a a : 2
a a e : 1
b : 5
b c : 3
b c d : 1
b c d e : 1
b c e : 2
b c a : 1
b d : 2
b d e : 1
b d a : 1
b e : 3
b e d : 1
b e d a : 1
b e a : 1
b a : 3
c : 4
c d : 1
c d e : 1
c e : 3
c e b : 1
c e b a : 1
c e a : 1
c a : 2
c b : 1
c b a : 1
d : 2
d e : 1
d a : 1
e : 6
e d : 1
e d a : 1
e a : 4
e a e : 1
e b : 2
e b c : 1
e b c a : 1
e b a : 2
e c : 1
e c a : 1
e e : 1
This does not seem to be the correct breakdown. @chuanconggao - could you please review https://github.com/chuanconggao/PrefixSpan-py/pull/25 and merge it, since it addresses the core problem here?
Merged #25 to fix this. Please reopen if there is further issue.
The way how the word dictionary and subsequently the inverted word dictionary is created is broken.
Take this input file:
Running
prefixspan-cli frequent 2 --closed --text test1.txt
creates the following output:This is obviously wrong.
The reason is that the word dictionary created is:
{'a': 0, 'b': 1, 'c': 2, 'd': 2, 'e': 0, 'f': 1, 'g': 2}
As you can see, the indices are not properly mapped to words and there are duplicate indices.The inverted map is therefore also wrong: {0: 'e', 1: 'f', 2: 'g'}