MichaelPaulukonis / text-munger

Automatically exported from code.google.com/p/text-munger
1 stars 1 forks source link

XrayWordRule (tokenizer) in MarkovGenerator does not group non-word punctuation #2

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
Neither of these makes me happy.
The XrayWordRule is retaining space chars, contractions are split into 3 
separate tokens.
BLEAUGH

DefaultRule:

>? splitted
Count = 13
    [0]: "This,"
    [1]: "is"
    [2]: "a"
    [3]: "bloody"
    [4]: "@#$@#$@#$"
    [5]: "examople"
    [6]: "of"
    [7]: "\"oh,"
    [8]: "I"
    [9]: "don't"
    [10]: "know?\""
    [11]: "something......"
    [12]: "ese....indeed?"
>

XrayWordRule:

Count = 52
    [0]: "This"
    [1]: ","
    [2]: " "
    [3]: "is"
    [4]: " "
    [5]: "a"
    [6]: " "
    [7]: "bloody"
    [8]: " "
    [9]: "@"
    [10]: "#"
    [11]: "$"
    [12]: "@"
    [13]: "#"
    [14]: "$"
    [15]: "@"
    [16]: "#"
    [17]: "$"
    [18]: " "
    [19]: "examople"
    [20]: " "
    [21]: "of"
    [22]: " "
    [23]: "\""
    [24]: "oh"
    [25]: ","
    [26]: " "
    [27]: "I"
    [28]: " "
    [29]: "don"
    [30]: "'"
    [31]: "t"
    [32]: " "
    [33]: "know"
    [34]: "?"
    [35]: "\""
    [36]: " "
    [37]: "something"
    [38]: "."
    [39]: "."
    [40]: "."
    [41]: "."
    [42]: "."
    [43]: "."
    [44]: " "
    [45]: "ese"
    [46]: "."
    [47]: "."
    [48]: "."
    [49]: "."
    [50]: "indeed"
    [51]: "?"

What is the expected output? What do you see instead?
punctuation following a word should be group as a token

in the following:
    [45]: "ese"
    [46]: "."
    [47]: "."
    [48]: "."
    [49]: "."
    [50]: "indeed"

there should be only three tokens:

    [45]: "ese"
    [46]: "...."
    [47]: "indeed"

Original issue reported on code.google.com by xraysmalevich on 7 May 2012 at 7:07