Open danluu opened 7 years ago
The vast majority of the really small documents (2 or 3 postings) are list documents. See, for example, https://en.wikipedia.org/?curid=1333 which is a page about the day "August 8." This page contains three words. The title, "August 8" and the body words "August" and "8". This problem should go away if we rerun wikiextractor with the --lists option. We should investigate the other options at https://github.com/attardi/wikiextractor/blob/master/README.md.
https://en.wikipedia.org/?curid=35348 is an example of a document with one posting. This is also a list document. The only posting is "130s" in the title.
This document turns out to be 0 sized, which seems a bit surprising. It has content in it, and the content has been there for years, so it's not that we got some old empty version. The document has the following text:
A & A is a computer virus which infects COM files. It changes an infected program’s time and date stamp to the date and time of infection. When activated, the virus clears and reprints blocks of the screen. The infection code contains the string {A&A}
BTW, here are the chunk files with 0 lengths after filtering are:
-rw-rw-r-- 1 danluu danluu 27 Dec 7 23:37 Chunk-1361.chunk
-rw-rw-r-- 1 danluu danluu 27 Dec 7 23:35 Chunk-288.chunk
-rw-rw-r-- 1 danluu danluu 27 Dec 7 23:44 Chunk-4016.chunk
-rw-rw-r-- 1 danluu danluu 27 Dec 7 23:47 Chunk-5677.chunk
-rw-rw-r-- 1 danluu danluu 27 Dec 7 23:49 Chunk-6139.chunk
I rebuilt the first chunk of wikipedia using the --list parameter to wikiextractor. This reduced the number of short documents significantly. Data below shows number of short documents without --list (on the left) and with --list (on the right):
Now I'm investigating remaining short documents.
11291(length 2) is a stub for Floccinaucinihilipilification. 11839 (length 2) is a soft redirect page for Wikipedia:GNUStufF 12296 (length 4) is a stub for List of German proverbs 12409 (length 4) is a stub for Wikipedia:GNE Project Files 24922 (length 4) is a stub for List of Polish proverbs
18247 (length 5) is an Index of philosophy articles (A–C). Most of the content for this page is not actually in the wikipedia dump source code. Just the title.
Does that fix change 36699652? It shouldn't be zero length anymore if the list is included, but it looks like it shouldn't have been zero length int he first place.
Just rebuilt the first chunk of wikipedia, adding the -s (preserve sections) and --filter_disambig_pages flags. Here are the results: no flags: 1536 documents with 25 or fewer postings --list: 119 documents with 25 or fewer postings --list -s --filter_disambig_pages: 60 documents with 25 or fewer postings.
Here are documents with 10 or fewer postings in the first chunk now:
5216: 3 11291: 2 11477: 8 12296: 4 18247: 5 18546: 6 21899: 10 24922: 4
Most of these are lists. 5216 is concerning because it has lots of text (see Khmer Language)
I investigated the Khmer Language page. There is text in the wikipedia dump file but wikiextractor loses nearly all of the text:
<doc id="5216" url="https://en.wikipedia.org/wiki?curid=5216" title="Khmer language">
Khmer language
</doc>
After the last set of fixes, the mode has changed from 5
to 24
.
We no longer have any 0-length documents and the number of 1-length documents went down from 9013
to 88
.
It seems likely that we still have problem documents, but it sounds like we're not going to go after them right now.
If we look at the wikipedia dump currently hosted on Azure, the modal number of postings per document is
5
, and things drop off rapidly from there: