BitFunnel / Workbench

Java and Lucene based tools for BitFunnel corpus preparation
http://bitfunnel.org
MIT License
19 stars 4 forks source link

Many terms have underscores in them #9

Open danluu opened 7 years ago

danluu commented 7 years ago

For example:

6d6b8015505c7099,1,1,4.61273e-07,2c_thrissur
3e8f9e5769458e9f,1,1,4.61273e-07,government_medical_college

We also have terms with double underscores that appear to be some kind of metadata?

868661c0426526a7,1,1,0.000557102,__noeditsection__
a135c90cbb896da0,1,1,2.97521e-05,__notoc__
14a64ebade034c85,1,1,3.11359e-06,__nogallery__

As well as weird terms that have even more underscores:

b614cd7474e25139,1,1,5.76591e-07,f___
22ed3514efd6df2a,1,1,4.61273e-07,o___y
3ea21b09f892bac0,1,1,4.61273e-07,mother______
c4767d3137687cf6,1,1,9.6262e-07,i_______________________________________
danluu commented 7 years ago

If you want something from chunked1, that has

3b15cf09a2fde054,1,1,6.59631e-05,______next
664c5c0a691d85f4,1,1,6.59631e-05,x__x
b1863a3a0b343641,1,1,6.59631e-05,20__
MikeHopcroft commented 7 years ago

"______next" is actually on the page: https://en.wikipedia.org/?curid=11139 "x__x" is in page https://en.wikipedia.org/?curid=24782 "20__" is in page https://en.wikipedia.org/?curid=21481 "f___" is in page https://en.wikipedia.org/?curid=83530

The above examples seem to be correctly extracted.

"2c_thrissur" seems to be a "%2c" (comma) in a url. See, for example, https://en.wikipedia.org/wiki/File:GovemedcollegethrissurDistricthosp.JPG. The question here is whether urls should be indexed. It seems that the word breaker split the word before the "%2c".

"noeditsection" matches documents 49483, 72692 (I only searched the first two chunks). The term seems to be Wikipedia markup. See, for example the source code for https://en.wikipedia.org/?curid=49483 at https://en.wikipedia.org/w/index.php?title=Wikipedia:Ignore_all_rules&action=edit

In this case, it is debatable whether this wikipedia markup should be extracted.