Open MikeHopcroft opened 8 years ago
Repro:
BitFunnel: 9e9e96ecb32841c53edc4542813ed1531fd4c4a9 Workbench: 580b74b421254f82348a811d7a886683c54c5a75
StatisticsBuilder c:\git\Wikipedia\Manifest100.txt c:\temp\wiki\out100 -statistics -text
Shouldn't have bigrams, shouldn't have capital letters:
Bigram where none expected (also capital letter): 72a2c4b53c781027,1,1,0.000144196,zephyrinus bd01f0b68e57b2a7,1,1,0.000144196,sveshtari 3fad0c4faf3cb52b,1,0,0.000144196,Algebraic geometry 50c9029d9d3c5378,1,1,0.000144196,darabont a2f5153a7612c5d0,1,1,0.000144196,up─üsik─ü
3ca7b8a975b95d4d,1,1,0.000144196,crisplock
Capital letter 49fc77672b6b54c4,1,0,0.000144196,Alexander Graham Bell
7d8b10a0a2b9f455,1,0,0.000144196,Evolutionarily stable strategy
Random garbase b651bc4fddcd84af,1,1,0.000144196,86p 6cc733ca24bc18e,1,1,0.000144196,ಹರಿವೆ 5847567b67dc03cb,1,1,0.000144196,xis
a31bc33fc17f3fc6,1,1,0.000144196,लाख
b3d2c5a33dd1efc6,1,1,0.000144196,k├╢nigsberger
This is probably due to the problem where the Lucene analyzer wasn't run in Workbench. https://github.com/BitFunnel/Workbench/pull/6
Repro:
BitFunnel: 9e9e96ecb32841c53edc4542813ed1531fd4c4a9 Workbench: 580b74b421254f82348a811d7a886683c54c5a75
StatisticsBuilder c:\git\Wikipedia\Manifest100.txt c:\temp\wiki\out100 -statistics -text
Shouldn't have bigrams, shouldn't have capital letters:
Bigram where none expected (also capital letter): 72a2c4b53c781027,1,1,0.000144196,zephyrinus bd01f0b68e57b2a7,1,1,0.000144196,sveshtari 3fad0c4faf3cb52b,1,0,0.000144196,Algebraic geometry 50c9029d9d3c5378,1,1,0.000144196,darabont a2f5153a7612c5d0,1,1,0.000144196,up─üsik─ü
3ca7b8a975b95d4d,1,1,0.000144196,crisplock
Capital letter 49fc77672b6b54c4,1,0,0.000144196,Alexander Graham Bell
7d8b10a0a2b9f455,1,0,0.000144196,Evolutionarily stable strategy
Random garbase b651bc4fddcd84af,1,1,0.000144196,86p 6cc733ca24bc18e,1,1,0.000144196,ಹರಿವೆ 5847567b67dc03cb,1,1,0.000144196,xis
a31bc33fc17f3fc6,1,1,0.000144196,लाख
b3d2c5a33dd1efc6,1,1,0.000144196,k├╢nigsberger