Open danluu opened 7 years ago
If you want something from chunked1, that has
3b15cf09a2fde054,1,1,6.59631e-05,______next
664c5c0a691d85f4,1,1,6.59631e-05,x__x
b1863a3a0b343641,1,1,6.59631e-05,20__
"______next" is actually on the page: https://en.wikipedia.org/?curid=11139 "x__x" is in page https://en.wikipedia.org/?curid=24782 "20__" is in page https://en.wikipedia.org/?curid=21481 "f___" is in page https://en.wikipedia.org/?curid=83530
The above examples seem to be correctly extracted.
"2c_thrissur" seems to be a "%2c" (comma) in a url. See, for example, https://en.wikipedia.org/wiki/File:GovemedcollegethrissurDistricthosp.JPG. The question here is whether urls should be indexed. It seems that the word breaker split the word before the "%2c".
"noeditsection" matches documents 49483, 72692 (I only searched the first two chunks). The term seems to be Wikipedia markup. See, for example the source code for https://en.wikipedia.org/?curid=49483 at https://en.wikipedia.org/w/index.php?title=Wikipedia:Ignore_all_rules&action=edit
In this case, it is debatable whether this wikipedia markup should be extracted.
For example:
We also have terms with double underscores that appear to be some kind of metadata?
As well as weird terms that have even more underscores: