BitFunnel / Workbench

Java and Lucene based tools for BitFunnel corpus preparation
http://bitfunnel.org
MIT License
19 stars 4 forks source link

Many n-grams in corpus #18

Open danluu opened 7 years ago

danluu commented 7 years ago

After processing wikipedia with the fixes as of 274293f3af97c507416f6387020507ee99ca3238, the tail of the DocFreqTable has a lot of n-grams:

724ddeaf8cb3c269,1,0,1.93455e-07,Vasilije Veljko Milovanović
e802585d5e004af1,1,0,1.93455e-07,2014 All-Arena Team
7c401744d5d61355,1,1,1.93455e-07,f.a.cortez
dafa24ba41b2a01d,1,0,1.93455e-07,Coeliades ramanatek
1a8055b58daaf330,1,0,1.93455e-07,Jeff Tobolski
adeb1f3f88d9bf92,1,1,1.93455e-07,shirt.turnfurlong
9dc6283de675270b,1,0,1.93455e-07,1978 Notre Dame Fighting Irish football team
5cc16879c0ad5653,1,1,1.93455e-07,shrambhushan
aea5e0ae16c34325,1,1,1.93455e-07,ca1703286
ce7ac1e3fa0fa95b,1,0,1.93455e-07,Hyperthaema sordida
bbd646c18643abf0,1,1,1.93455e-07,yelkhovoozersky
895697f8c748363f,1,0,1.93455e-07,Alashkert Stadium
d5ddbbd6281b2f91,1,0,1.93455e-07,Crédito Predial Português
7a18bab66de2a784,1,0,1.93455e-07,List of wars involving Iraqi Kurdistan
71000fa2b784fbb1,1,1,1.93455e-07,alox12p
6c47ffa2419cebfc,1,0,1.93455e-07,Republican Social Party of French Reconciliation
...
91ea6c89333d46fe,1,0,1.93455e-07,Janet Jackson filmography
596acddb187d2224,1,1,1.93455e-07,bingobingo
7f4e295958f0d3ad,1,0,1.93455e-07,Nawal al-Hawsawi
2b3c46a61d6a01,1,1,1.93455e-07,arachidconic
a99158398732ad89,1,0,1.93455e-07,Sâne Morte
a07381d76998301,1,1,1.93455e-07,blind.net
5252fcd8785074c4,1,1,1.93455e-07,aettn
2467896c19e4ae96,1,0,1.93455e-07,Montsweag Bay
9e49735fc54b7c76,1,0,1.93455e-07,"Friedrich Günther, Prince of Schwarzburg-Rudolstadt"
4c46bc6fc4ce2549,1,0,1.93455e-07,Herman Riddle Page
79ba22ae9e1cfc9c,1,1,1.93455e-07,689368
d50cfa30c7b6357a,1,0,1.93455e-07,McDonald's African American Heritage Series
f237df08253dc88,1,0,1.93455e-07,Ireland at the 2000 Summer Olympics
dfa3155bb84d1397,1,0,1.93455e-07,Preferential creditor
eef202c76f008699,1,0,1.93455e-07,Virgilio Tosi
577e733f140b86b2,1,1,1.93455e-07,agents13_en.html
b96d6bb3dd1da702,1,0,1.93455e-07,Ravi Gossain
d70665fd53174abe,1,1,1.93455e-07,mcdean's
92f39c417e294d1f,1,1,1.93455e-07,sonnenritter
558a9b05e72d7319,1,0,1.93455e-07,Althaea
5406174c6bc23256,1,1,1.93455e-07,ouleida
8542aa4e4d48249f,1,0,1.93455e-07,"John Savile, 4th Earl of Mexborough"
5274284f94fffeb6,1,0,1.93455e-07,Sue Timney
62d049ebf69b2705,1,1,1.93455e-07,commercially.it
35fec685ff1011ea,1,1,1.93455e-07,comfia
565d59d3b90b8fee,1,0,1.93455e-07,List of Banksia species
51120ae38af4f54b,1,1,1.93455e-07,spökprästgården
danluu commented 7 years ago

Of the 10677410 terms that appear once, 4512265 (or 42%) are n-grams. This is out of 14643204 total terms.

danluu commented 7 years ago

Note that the n-grams don't seem to be downcased.