BitFunnel / Workbench

Java and Lucene based tools for BitFunnel corpus preparation
http://bitfunnel.org
MIT License
19 stars 4 forks source link

The vast majority of documents are tiny #8

Open danluu opened 7 years ago

danluu commented 7 years ago

If we look at the wikipedia dump currently hosted on Azure, the modal number of postings per document is 5, and things drop off rapidly from there:

Postings,Count
0,5
1,9013
2,161034
3,490873
4,752513
5,795627
6,458944
7,297922
8,187495
9,159601
10,122515
11,98068
12,93155
13,82168
14,80742
15,74154
16,69059
17,64268
18,67888
19,63546
20,63112
MikeHopcroft commented 7 years ago

The vast majority of the really small documents (2 or 3 postings) are list documents. See, for example, https://en.wikipedia.org/?curid=1333 which is a page about the day "August 8." This page contains three words. The title, "August 8" and the body words "August" and "8". This problem should go away if we rerun wikiextractor with the --lists option. We should investigate the other options at https://github.com/attardi/wikiextractor/blob/master/README.md.

MikeHopcroft commented 7 years ago

https://en.wikipedia.org/?curid=35348 is an example of a document with one posting. This is also a list document. The only posting is "130s" in the title.

danluu commented 7 years ago

This document turns out to be 0 sized, which seems a bit surprising. It has content in it, and the content has been there for years, so it's not that we got some old empty version. The document has the following text:

A & A is a computer virus which infects COM files. It changes an infected program’s time and date stamp to the date and time of infection. When activated, the virus clears and reprints blocks of the screen. The infection code contains the string {A&A}

danluu commented 7 years ago

BTW, here are the chunk files with 0 lengths after filtering are:

-rw-rw-r-- 1 danluu danluu     27 Dec  7 23:37 Chunk-1361.chunk
-rw-rw-r-- 1 danluu danluu     27 Dec  7 23:35 Chunk-288.chunk
-rw-rw-r-- 1 danluu danluu     27 Dec  7 23:44 Chunk-4016.chunk
-rw-rw-r-- 1 danluu danluu     27 Dec  7 23:47 Chunk-5677.chunk
-rw-rw-r-- 1 danluu danluu     27 Dec  7 23:49 Chunk-6139.chunk
MikeHopcroft commented 7 years ago

I rebuilt the first chunk of wikipedia using the --list parameter to wikiextractor. This reduced the number of short documents significantly. Data below shows number of short documents without --list (on the left) and with --list (on the right):

image

MikeHopcroft commented 7 years ago

Now I'm investigating remaining short documents.

11291(length 2) is a stub for Floccinaucinihilipilification. 11839 (length 2) is a soft redirect page for Wikipedia:GNUStufF 12296 (length 4) is a stub for List of German proverbs 12409 (length 4) is a stub for Wikipedia:GNE Project Files 24922 (length 4) is a stub for List of Polish proverbs

18247 (length 5) is an Index of philosophy articles (A–C). Most of the content for this page is not actually in the wikipedia dump source code. Just the title.

danluu commented 7 years ago

Does that fix change 36699652? It shouldn't be zero length anymore if the list is included, but it looks like it shouldn't have been zero length int he first place.

MikeHopcroft commented 7 years ago

Just rebuilt the first chunk of wikipedia, adding the -s (preserve sections) and --filter_disambig_pages flags. Here are the results: no flags: 1536 documents with 25 or fewer postings --list: 119 documents with 25 or fewer postings --list -s --filter_disambig_pages: 60 documents with 25 or fewer postings.

image

MikeHopcroft commented 7 years ago

Here are documents with 10 or fewer postings in the first chunk now:

5216: 3 11291: 2 11477: 8 12296: 4 18247: 5 18546: 6 21899: 10 24922: 4

Most of these are lists. 5216 is concerning because it has lots of text (see Khmer Language)

MikeHopcroft commented 7 years ago

I investigated the Khmer Language page. There is text in the wikipedia dump file but wikiextractor loses nearly all of the text:

<doc id="5216" url="https://en.wikipedia.org/wiki?curid=5216" title="Khmer language">
Khmer language

</doc>
danluu commented 7 years ago

After the last set of fixes, the mode has changed from 5 to 24.

We no longer have any 0-length documents and the number of 1-length documents went down from 9013 to 88.

It seems likely that we still have problem documents, but it sounds like we're not going to go after them right now.