BitFunnel / Workbench

Java and Lucene based tools for BitFunnel corpus preparation
http://bitfunnel.org
MIT License
19 stars 4 forks source link

Wikipedia extraction seems to be giving bigrams #3

Open MikeHopcroft opened 8 years ago

MikeHopcroft commented 8 years ago

Repro:

BitFunnel: 9e9e96ecb32841c53edc4542813ed1531fd4c4a9 Workbench: 580b74b421254f82348a811d7a886683c54c5a75

StatisticsBuilder c:\git\Wikipedia\Manifest100.txt c:\temp\wiki\out100 -statistics -text

Shouldn't have bigrams, shouldn't have capital letters:

Bigram where none expected (also capital letter): 72a2c4b53c781027,1,1,0.000144196,zephyrinus bd01f0b68e57b2a7,1,1,0.000144196,sveshtari 3fad0c4faf3cb52b,1,0,0.000144196,Algebraic geometry 50c9029d9d3c5378,1,1,0.000144196,darabont a2f5153a7612c5d0,1,1,0.000144196,up─üsik─ü

3ca7b8a975b95d4d,1,1,0.000144196,crisplock

Capital letter 49fc77672b6b54c4,1,0,0.000144196,Alexander Graham Bell

7d8b10a0a2b9f455,1,0,0.000144196,Evolutionarily stable strategy

Random garbase b651bc4fddcd84af,1,1,0.000144196,86p 6cc733ca24bc18e,1,1,0.000144196,ಹರಿವೆ 5847567b67dc03cb,1,1,0.000144196,xis

a31bc33fc17f3fc6,1,1,0.000144196,लाख

b3d2c5a33dd1efc6,1,1,0.000144196,k├╢nigsberger

MikeHopcroft commented 8 years ago

This is probably due to the problem where the Lucene analyzer wasn't run in Workbench. https://github.com/BitFunnel/Workbench/pull/6