Open GoogleCodeExporter opened 9 years ago
Neither of these makes me happy. The XrayWordRule is retaining space chars, contractions are split into 3 separate tokens. BLEAUGH DefaultRule: >? splitted Count = 13 [0]: "This," [1]: "is" [2]: "a" [3]: "bloody" [4]: "@#$@#$@#$" [5]: "examople" [6]: "of" [7]: "\"oh," [8]: "I" [9]: "don't" [10]: "know?\"" [11]: "something......" [12]: "ese....indeed?" > XrayWordRule: Count = 52 [0]: "This" [1]: "," [2]: " " [3]: "is" [4]: " " [5]: "a" [6]: " " [7]: "bloody" [8]: " " [9]: "@" [10]: "#" [11]: "$" [12]: "@" [13]: "#" [14]: "$" [15]: "@" [16]: "#" [17]: "$" [18]: " " [19]: "examople" [20]: " " [21]: "of" [22]: " " [23]: "\"" [24]: "oh" [25]: "," [26]: " " [27]: "I" [28]: " " [29]: "don" [30]: "'" [31]: "t" [32]: " " [33]: "know" [34]: "?" [35]: "\"" [36]: " " [37]: "something" [38]: "." [39]: "." [40]: "." [41]: "." [42]: "." [43]: "." [44]: " " [45]: "ese" [46]: "." [47]: "." [48]: "." [49]: "." [50]: "indeed" [51]: "?" What is the expected output? What do you see instead? punctuation following a word should be group as a token in the following: [45]: "ese" [46]: "." [47]: "." [48]: "." [49]: "." [50]: "indeed" there should be only three tokens: [45]: "ese" [46]: "...." [47]: "indeed"
Original issue reported on code.google.com by xraysmalevich on 7 May 2012 at 7:07
xraysmalevich
Original issue reported on code.google.com by
xraysmalevich
on 7 May 2012 at 7:07