aarddict / tools

Tools for Aard Dictionary
GNU General Public License v3.0
14 stars 13 forks source link

aardc slow #15

Closed pez252 closed 11 years ago

pez252 commented 12 years ago

Compile time on aardc puts conversion of large dictionaries outside the reach of most users.

Expected process time was 66 days for enwiki on a machine with 4 processors and 2GB of ram. After moving to a machine with 32 processors and 100GB of RAM (some of which was used as a ramdisk for the source and destination files for aardc) it still took 18 hours.

If the process was faster ( less than 2 days on middle of the line current PC ) users would be more likely to compile current dictionaries for the community.

itkach commented 12 years ago

Keep in mind that article conversion speed is not constant at all and tends to increase towards the end, so projecting expected time based on initial speed after few minutes or even hours just doesn't give you any useful estimate.

English Wikipedia from a March 2011 dump took 2 days 18 hours to compile on a 2.66Ghz i7 (quad core with hyperthreading) with 6Gb of RAM and regular 7200 RPM hard drive. If I'm not mistaken, article conversion took about 2/3 of total time, the rest of the time is to assemble index and converted articles into final volumes. It is slow, but still not too far off from what you want, although I suppose this depends on how you define "middle of the line" PCs.

Article conversion benefits greatly from increased number of CPUs/cores, while final volume assembly depends mostly on disc speed. Volume assembly may be easier to optimize, with a smarter algorithm and/or new binary layout. Article conversion is mostly done by mwlib and with MediaWiki markup and templates being very complex and dirty (I think crazy is the proper scientific term to describe it) it's hard to pinpoint a particular bottleneck, so the most effective path to speed it up is probably distributing across multiple machines - but this is rather contrary to the goal of making it more "user compileable" as most people probably don't have access to large enough number of machines and time/energy/skill to configure them appropriately.

pez252 commented 12 years ago

I noticed the articles/second was slowly increasing and had assumed it to be an inaccurate calculation of the average. on the 4 processor machine I ran for 2 hours before calculating the 66 day number based on the work done since starting. I then switched to another machine. This issue is squarely in the realm of enhancement, and should be less of an issue over time (enwiki growth is slowing, and processor power will grow) but I think optimization merits consideration.

Speaking of community contributions, is there a more appropriate place to share the latest dictionary than here (and that android ticket)

English Wikipedia Jan 2012 (torrent)

Thanks, and keep up the great work!

itkach commented 12 years ago

Calculation of average is accurate, it just changes over time: articles vary greatly in their size, complexity and number and complexity of templates they include. Also, significant portion of all titles in a typical MediaWiki dump is redirects which are processed quickly.

A good way to share is to send me a link in a github message or on twitter - I can add it to http://aarddict.org/dictionaries/

itkach commented 12 years ago

Usually it's a good idea to compile a small number of articles first (say, 10000 for enwiki) and check the article formatting and metadata before compiling the whole thing. Siteinfo as returned by Wikipedia API pretty much always is missing a bunch of languages for recently added Wikipedias (like bjn, rue, gag, mrj, nso), so language links for them are not properly parsed and appear at the end of some articles. These need to be added manually to interwikimap list in siteinfo json document before compilation. Also, from time to time WIkipedia adds new template elements like navigation tables that don't necessarily look good or make sense in Aard Dictionary. They are repeated on many pages and unnecessarily increase size of resulting dictionary. It is best to exclude them (by adding to EXCLUDE_CLASSES or EXCLUDED_IDS in mwaardhtmlwriter.py - see issue #11). Also, make sure license text is added to metadata. aardtools/licenses/ includes the two license used for Wikipedia (CC with attribution and GDFL) and should automatically pick one of those based on siteinfo. License text is missing for some reason in your compilation. Also, it's nice to compile with some language links included (using --lang-links aardc option)

pez252 commented 12 years ago

Is there any logic to which are returned by the API and which are not?

Oriya (or) accepted on 29-Sep-2011, 1988 articles was returned Hill Mari (mrj) accepted on 17-Oct-2010, 5050 articles is not returned

Accepted/rejected: http://meta.wikimedia.org/wiki/Requests_for_new_languages#Recently_closed List of Wikipedias: http://meta.wikimedia.org/wiki/List_of_Wikipedias

I do see the improperly parsed links at the end of some articles.

I didn't see any navigational or similar elements that you have not already filtered. I did see where to do this in mwaardhtmlwriter.py and will keep an eye out for something. I think I've corrected the license files on my install of aardtools. Next run should put the correct licenses. I'm going to fix any languages I can find that are an issue in my en.json and recompile the dictionary with --lang-links tonight.

I guess my only followup questions boil down to: How do you find which languages are missing from en.json? and... How can I identify additional elements that need to be filtered by mwaardhtmlwriter.py?

itkach commented 12 years ago

I have no idea if there's any logic in how new languages are accepted, but it certainly takes a while for these changes to propagate. The best way to identify missing languages would probably be to compare the list of prefixes from interwiki list in siteinfo json document with http://meta.wikimedia.org/wiki/List_of_Wikipedias, but compiling a large enough sample and browsing around also reveals offenders quickly. Titles like names of countries and large cities, numbers or other generic terms tend to have many language links (things that get written about in new wikipedias as first order of business). For example, looking at the article for Moscow tells us we're missing kbd, gag, koi, mrj, frr, rue and looking at Earth yields ltg, xmf, frr, pfl, koi, rue, nso. I published repo with siteinfo I used for previous compilations here: https://bitbucket.org/itkach/siteinfo

The way to identify candidate elements for removal is, again, to compile large enough sample and check some random articles. When checking articles in a sample compilation, start aarddict with -e command line option to enable WebKit dev tools. When started with -e, article view context menu (right click) gets "Inspect" which allows to view article's HTML in WebKit's WebInspector. If undesired parts of resulting articles have ids or css classes they can be filtered out by adding to exclude lists already mentioned. If not... there's no good mechanism at the moment (again, see #11), but here's one example of a possible approach, ugly brute force.

Quickly poking around in your Jan 2012 compilation shows for example that letters of Latin alphabet all have big useless table titled "The ISO basic Latin alphabet". I'd prefer it removed, although this one doesn't have an id or narrow enough class to go by and it's not on that many pages. Some articles have empty sections like "External Links" or "See also", articles look better without them. But English Wikipedia compiles fairly clean already, so perhaps these minor cleanups are not worth the effort in this case (unlike, say, with English Wiktionary which is a lot dirtier and lighter on good content). Still, it's worthwhile to visually check new compilations because sometimes new templates are introduced and affect all or a significant number of articles, or existing templates are modified and break parsing.

pez252 commented 12 years ago

Checking things I thought would be common articles is what I ended up doing. I think I checked Air, Earth, Water and a couple others. The list I came up with was bjn, ltg, xmf, rue, kbd, gag, frr, koi, nso, pfl

For --lang-links, I wasn't sure if a large number of languages would cause any issues. I went with the top 50 Wikipedia languages.

I then made the following with the languages added to en.json and with --lang-links specified: link

I had a bit of fun with grep and cut to get the following list of languages that are listed on the List_of_Wikipedias page and not in what is returned by aard-siteinfo:

ace bjn ckb frr gag kbd koi krc ltg mhr mrj mwl nso pcd pfl pnb rue xmf

I'm going to wait for the next wikipedia dump and add the other missed languages.

I see the "ISO basic Latin alphabet" has a class of unicode which sounds fairly generic. In my looking around I didn't see any other boxes (though I did see a blank pre tag at the bottom of a couple articles...). I bunzipped the wiki xml dump to see what I could find, and found many links to audio/video files using class="unicode".

Is there a way to filter by the template used? grep '{{Latin alphabet|.|}}' enwiki-20120104-pages-articles.xml | wc -l came back with 55 results. I suppose I could do grep -v and write the whole thing out without the Latin alphabet template references.

itkach commented 12 years ago

enwiki from March 2011 was compiled with language links from several major Wikipedia languages, plus Greek and Latin. The only thing about adding more --lang-links languages is that it increases the size of the resulting dictionary. From your compilations it looks like adding 50 top languages added 0.3GiB (9.4 vs 9.7 GiB) - fairly minor difference relative to the overall size of enwiki, but not insignificant if considered as an absolute value (bigger then whole bunch of complete wikis).

Pre-processing Wikipedia dump and removing certain templates out of band is probably the easiest way go about it. Pre-processing raw Wikipedia markup in wiki.py for individual articles before conversion should also work, with regexps to be removed/replaced coming from an external "config" file. Post-processing resulting XML, either as parsed tree or raw text is another approach, but it is not as convenient since conversion must be done first to get any idea of what portion of XML may need to be removed, and there may not be an easy way to identify that portion anyway. Technically, working with parsed wikipedia markup tree produced by mwlib is another place where cleanup and filtering may be plugged in, but my impression is that this is tricky and brittle, with no chance of applying declarative regexp or XPath rule sets for filtering and high likelihood of breakage with even minor mwlib updates.

itkach commented 12 years ago

One problem I see with both compilations is that "server" property in "general" section of siteinfo is broken - it has value of "//en.wikipedia.org" instead of "http://en.wikipedia.org". Apart from producing broken source link in dictionary info, this also breaks "view online version" action. Also, looks like mrj language is still missing from siteinfo.

pez252 commented 12 years ago

Hello again after almost a month!

English Wikipedia February 2012

I used the same updated en.json file as the previous version. I've filtered everything we discussed. Give it a download, and if all looks good, feel free to link to it from the dictionaries page (or rehost it with clearbits.... I'll help seed that one if you do).

Here are the commands in the compile process just for anyone who finds this discussion:

wget http://dumps.wikimedia.org/enwiki/20120211/enwiki-20120211-pages-articles.xml.bz2 bunzip2 enwiki-20120211-pages-articles.xml.bz2 grep -v '{{Latin alphabet|.|}}' enwiki-20120211-pages-articles.xml > enwiki-20120211-pages-articles-filtered.xml rm enwiki-20120211-pages-articles.xml bzip2 enwiki-20120211-pages-articles-filtered.xml mw-buildcdb --input enwiki-20120211-pages-articles-filtered.xml.bz2 --output enwiki-20120211-pages-articles-filtered.cdb; aardc --lang-links de,fr,nl,it,pl,es,ru,ja,pt,sv,zh,ca,uk,no,fi,vi,cs,hu,ko,id,tr,ro,fa,ar,da,eo,sr,lt,sl,sk,ms,he,bg,kk,eu,vo,war,hr,hi,et,az,gl,simple,nn,th,new,el,la,roa-rup,oc wiki enwiki-20120211-pages-articles-filtered.cdb --siteinfo ../en.json

itkach commented 12 years ago

Thank you. Source URL now looks good and "View Online" works. I noticed, however, that notes/references didn't come out right - compare article "Earth" in Feb 2012 and Mar 2011, you'll see the difference. I see that I compiled Mar 2011 with mwlib 0.12.14 instead of 0.12.13 that is specified in setup.py, I don't know remember the details, but reference handling may have been the reason. mwlib itself doesn't handle references properly, so aardtools try to rectify this (relevant code is here, if you care to look), but the code is apparently rather brittle. mwlib versions after 0.12.13 appear to have an issue with namespace handling for some wikipedia (Asian languages specifically), which is why 0.12.13 is in setup.py... I need to investigate this further. In the meantime, if you feel like it, try compiling a small sample with mwlib 0.12.14 - articles "Earth" and "Moon" are good test cases and I think they are within first 10000 articles, so you can verify this quickly (compile with --article-count 10000)

itkach commented 12 years ago

Oh, and it looks like that grep -v '{{Latin alphabet|.|}}' command didn't work - the table is still there... Probably {} need escaping...

itkach commented 12 years ago

I recompiled a small sample of enwiki with mwlib 0.12.14 and this fixed references and notes rendering. I also compiled a small sample of zhwiki and a large number of articles indeed fails... mwlib dev has been quite active, already at 0.13.4, ideally aardtools should be upgraded, but unfortunately mwlib seems to be in in a direction very different from what aardtools needs, the api aardtools rely on is either deprecated or changed significantly. Catching up with mwlib will be a non-trivial exercise.

Anyway, can you please try it again with mwlib 0.12.14?

maester commented 12 years ago

For compressing big files on multicore/multiprocessor machines may be faster to use the parallel version of bzip2 : lbzip2

Here are some numbers http://vbtechsupport.com/1614/

pez252 commented 12 years ago

Which letter did you see a table on? I checked A and a few others and didn't see the table. I did however notice that some letters are missing... I couldn't find B, E, J and a few more in my February dictionary, though they were in my previous versions. I'll take a look on to make sure the filtered XML included the article for E. If so, maybe my grep garbled one of the tags.

I did see the references issue and will try mwlib 0.12.14. Unfortunately someone brought down the machine I was using to compile on... I'll take a look on Tuesday when I'm back to work.

I'll also give lbzip2 a try. The compression of the filtered xml doesn't take long compared to the compile, but any little bit helps.

itkach commented 12 years ago

I see the alphabet table in articles like Æ or Á or Ă (looks like pretty much all variations of A except A itself have it). I see articles for B, E and J... maybe you didn't have all the volumes open?

I wonder if re-compressing xml dump is necessary, I think mw-buildcdb should be able to read uncompressed dump just as well.

pez252 commented 12 years ago

Just for fun, I installed lbzip2, and here are some results: ubuntu:~/ramdisk$ time bzip2 -d enwiki-20120211-pages-articles.xml.bz2

real 34m10.443s user 33m32.042s sys 0m37.330s ubuntu:~/ramdisk$ rm enwiki-20120211-pages-articles.xml ubuntu:~/ramdisk$ time lbzip2 -d enwiki-20120211-pages-articles.xml.bz2

real 1m43.202s user 53m10.511s sys 0m55.823s

Just a bit faster...

.

As for the grep, I found that A through Z had the template referenced as {{Latin alphabet|.|}}, but other letters could just be {{Latin alphabet}}, {{Latin alphabet||dot}, or similar. They are all still on their own lines(except on the Latin alphabet testcases letters page), so we can modify the regex to fit all the cases where it is on an article.

grep -v '${{Latin alphabet[}|]' enwiki-20120211-pages-articles.xml enwiki-20120211-pages-articles-filtered.xml

In total this table was on 222 articles.

Upgraded to mwlib 0.12.14 with the following pip uninstall mwlib pip install mwlib=0.12.14

I changed 0.12.13 to 0.12.14 in the file: /usr/local/lib/python2.7/dist-packages/aardtools-0.8.3.egg-info/requires.txt No idea if this is the right way to change the requirement, but it worked...

I did not compress the xml before running mw-buildcdb.

A compile of the first 10000 looked good. Full enwiki compiling now.

pez252 commented 12 years ago

Compile completed a few hours ago... English Wikipedia February 2012

Take a look and let me know if you spot any issues.

There is an error on à that says: Error using {{unichar}}: Input "00C2" is not a hexadecimal value (expected: like "09AF") This error exists in your March 2011 compile too. 00C2 looks like valid hex to me, so I'm not sure what it is complaining about (or where the error originates)

itkach commented 12 years ago

Looks pretty good, I think it's ready to replace March 2011 version. I'll update http://aarddict.org/dictionaries to link to this torrent (if you don't mind) and to Wuala for http download. Clearbits offers too little storage, I don't want to put it up there.

pez252 commented 12 years ago

Sounds like a plan. Glad I could be of help, and thanks for troubleshooting along the way.

I'll keep seeding the torrent. The machine I am seeding from has reasonably fast upload speeds, but it crashes every few weeks and needs me here to push the button.

pez252 commented 12 years ago

English Wikipedia March 2012

Couple days late... I did a copy/paste of my regex without paying attention and only noticed when reviewing the output the next day that I had $ at the start of the regex rather than ^ so it wouldn't match anything... Below are the correct series of commands.

wget http://dumps.wikimedia.org/enwiki/20120307/enwiki-20120307-pages-articles.xml.bz2 lbzip2 -d enwiki-20120307-pages-articles.xml.bz2 grep -v '^{{Latin alphabet[}|]' enwiki-20120307-pages-articles.xml > enwiki-20120307-pages-articles-filtered.xml mw-buildcdb --input enwiki-20120307-pages-articles-filtered.xml --output enwiki-20120307-pages-articles-filtered.cdb; aardc --lang-links de,fr,nl,it,pl,es,ru,ja,pt,sv,zh,ca,uk,no,fi,vi,cs,hu,ko,id,tr,ro,fa,ar,da,eo,sr,lt,sl,sk,ms,he,bg,kk,eu,vo,war,hr,hi,et,az,gl,simple,nn,th,new,el,la,roa-rup,oc wiki enwiki-20120307-pages-articles-filtered.cdb --siteinfo ../en.json

pez252 commented 12 years ago

English Wikipedia April 2012

I added the language lez to the en.json file and processed as before.

itkach commented 12 years ago

There's a regression compared to February version: error unknown operator: u'strong' shows up in text of many articles instead of a number in front of things like sq mi or km2 or °F (see any article on geography, e.g. Atlantic Ocean)

itkach commented 12 years ago

Or, rather, the error was there all along, but in April dump the offending template is now used a lot more, so this is a lot more visible.

itkach commented 12 years ago

I started seeding again, but you will probably be disappointed (see comments above)

kybernetikos commented 12 years ago

Thanks for seeding anyway! It made an immediate difference.

josefranca commented 12 years ago

itkach, how can we fix this so if I compile a new version doesn't have the error you mention? I got a spare machine at work that I want to put to the task of compiling a recent wikipedia version.

Thanks.

doozan commented 12 years ago

Here's how I installed aardtools and compiled an updated enwiktionary:

sudo apt-get install build-essential python-dev python-virtualenv libicu44 libicu-dev

virtualenv env-aard
source env-aard/bin/activate

pip install mwlib==0.12.14
pip install PyICU==1.2
pip install -e git+git://github.com/itkach/tools.git@fix_wiktionary_unicode_template#egg=aardtools

# download the xml dump
wget http://dumps.wikimedia.org/enwiktionary/20120910/enwiktionary-20120910-pages-articles.xml.bz2

# create the cdb
mw-buildcdb --input enwiktionary-20120910-pages-articles.xml.bz2 --output enwiktionary-20120910-pages-articles.cdb

# Generate site info
aard-siteinfo http://en.wiktionary.org > enwiktionary.json
# Fix the server url (it starts //en.wiktionary when it should be http://en.wiktionary)
sed -i 's|"//en.wiktionary|"http://en.wiktionary|' enwiktionary.json

# Compile test dictionary
aardc wiki enwiktionary-20120910-pages-articles.cdb --siteinfo enwiktionary.json --article-count 100

# Compile full dictionary
aardc wiki enwiktionary-20120910-pages-articles.cdb --siteinfo enwiktionary.json

(Post edited to reflect proper instruction for building enwiktionary)

itkach commented 12 years ago

I compiled enwiktionary with this version of aardtools: https://github.com/itkach/tools/tree/fix_wiktionary_unicode_template

doozan commented 12 years ago

Thanks, I compiled enwiktionary using the repo you suggested. The resulting dictionary is available here: http://download.doozan.com/aard/enwiktionary-20120910.aar

I scanned a number of articles and it looks clean to me. It's 1.1G vs 900M from the earlier dictionary; it seems there have been a number of translations added to many of the words. If it looks good to you, you're more than welcome to put it up on the main site.

doozan commented 12 years ago

I should also mention that the above dictionary took a full day to compile on my machine, so there's no way I'll be able to compile a full wikipedia dump. It would be great if we could resolve the unknown operator: u'strong' error above so that someone can compile a newer dictionary. Is that error coming from mwlib or aaardc?

Is there any way to extract arrd articles from a dictionary using command line tools? It would be nice for regression testing if we could compile a list of articles that are known to be temperamental and that could be extracted from a new dictionary and scanned for references to tables or invalid html or error messages.

itkach commented 12 years ago

unknown operator: u'strong' is coming from mwlib. aardtools work with fairly old versions of mwlib, perhaps this is fixed in newer versions, however many other things changed too, so getting aardtools to use latest mwlib will likely be quite a bit of work.

The way to read aard articles from dictionary is of course to use aarddict: create Volume instance and use it as a regular python dictionary. I don't think, however, that this can replace eyeballing though - formatting seems to be usually breaking in some new ways and these tests will be brittle and always playing catch up.

doozan commented 12 years ago

I did some poking around in the newer mwlib packages:

Version 0.12.15 seems to work just fine. I can't tell if it fixed the unknown operator: u'strong' error because I only tested with the simple English wikipedia and that doesn't seem to use the offending template.

Version 0.12.16 removed the xhtmlwriter component that aardtools uses for parsing/cleaning the DOM, which makes it and all future versions unusable without either forking mwlib and re-adding xhtmlwriter or abandoning the mwlib implementation of xhtmlwriter in favor of another xml parser.

The latest version, 0.14.0, has removed the buildcdb tool entirely, which pretty much kills any future use of mwlib for one-shot handling of the entire database.

I have practically zero experience with Python, but some quick searches suggest that it should be feasible to replace xhtmlwriter with lxml or beautifulsoup. Wikimedia's own mwdumper.jar looks like it could be a candidate for replacing the buildcdb tool.

Itkach, do you have any thoughts on whether it's better to keep adapting to mwlib or replace it with components that have a more specific goal?

doozan commented 12 years ago

I forked mwlib and rebased it without the changes that removed xhtmlwriter and cdb support. I patched aardtools to change the mwlib=0.12.13 dependency to just mwlib. The resulting code seems to work and I compiled a sample of the first 6000 or so enwikipedia entries using this:

apt-get install cython

virtualenv env-aard-mwlib-test
. env-aard-mwlib-test/bin/activate

# mwlib won't build without rst2html.py and I don't know where to find it
# luckily, it's just for the docs, so replacing it with /bin/true works
ln -s /bin/true env-aard-mwlib-test/bin/rst2html.py

pip install -e git+git://github.com/doozan/aard-mwlib.git#egg=mwlib
pip install -e git+git://github.com/doozan/aard-tools.git#egg=aardtools

aard-siteinfo en.wikipedia.org > enwiki.json
# Fix the server url (it starts // when it should be http://)
sed -i 's|"//en.wiki|"http://en.wiki|' enwiki.json

# Get the first chunk of the the enwiki
wget http://dumps.wikimedia.org/enwiki/20120902/enwiki-20120902-pages-articles1.xml-p000000010p000010000.bz2
bzip -d enwiki-20120902-pages-articles1.xml-p000000010p000010000.bz2
grep -v '^{{Latin alphabet[}|]' enwiki-20120902-pages-articles1.xml-p000000010p000010000 > enwiki-sample-filtered.xml

mw-buildcdb --input enwiki-sample-filtered.xml --output enwiki-sample.cdb
aardc wiki enwiki-sample.cdb --siteinfo enwiki.json

The resulting file is available at http://download.doozan.com/aard/enwiki-sample.aar

I don't see any of the unknown operator: u'strong' errors, but it looks like there's some new cruft to clean up. The Atlantic Ocean entry, for example, has an empty box at the top and at the end of the article has a lot of empty list items under 'Bordering countries and territories.

I hope this can be a starting point for anyone with the resources to compile the full Wikipedia.

doozan commented 12 years ago

I pushed out a fix for the empty banner and updated the above sample dictionary. If anyone with the resources to compile a full wikipedia dump, now is a good time to jump in.

doozan commented 12 years ago

I did some benchmarks running aardc with different versions of python and various optimizations:

aardc wiki simplewiki-20120912.cdb -r --article-count 10000 --siteinfo simple.json

time
0:15:05 python 2.6.6 on (debian squeeze)
0:14:10 python 2.6.6 + psyco (debian squeeze)
0:11.43 python 2.7.3rc2  (debian wheezy)
0:10:13 python 2.7.3rc2 without manual garbage collection in wiki.py (debian wheezy)

As you can see, the biggest performance bump comes from using python 2.7 rather than 2.6. Disabling the manual garbage collection in aardtools/wiki.py brings another significant decrease in compilation time and doesn't seem to result in any nasty memory leaks.

Perhaps itkach can comment on the wisdom of removing the manual garbage collection.

doozan commented 12 years ago

The latest mwlib (0.14.1) fixes the unknown operator: u'strong' error.

As I mentioned earlier, mwlib 0.14 has removed the xhtml support and the cdb support. The good news is that the xhtml support is available from pediapress as mwlib.xhtml. I took the old cdb code and released it as mwlib.cdb

I've updated the aardtools code with some fixes to work with the latest mwlib and fixes for the dependency issues. You can now just install the aardtools package and have it pull in all the correct libraries.

virtualenv env-aard
source env-aard/bin/activate

pip install -e git+git://github.com/doozan/aard-tools.git#egg=aardtools

And that's it. Assuming you've install latex/blahtex/texvc you should be ready to compile dictionaries.

itkach, feel free to pull my fixes into the master branch.

itkach commented 12 years ago

@doozan thanks, I'll take a look and will try to compile enwiki and enwiktionary.

As for calling gc in wiki.py - it very well may be unnecessary/not useful.

itkach commented 12 years ago

about replacing mwlib and xhtmlwriter... The way I see it mwlib's main value is that it parses wiki markup and templates very accurately and produces a DOM-like representation from which things like XML and/or HTML can be generated. That is not something that can be easily replicated. https://www.mediawiki.org/wiki/Alternative_parsers lists some parsers that were unavailable before, but wiki markup and templates being as they are I wouldn't expect much... Alternative to current implementation could be to download rendered html for each article (from mwlib renderserver, from wikipedia.org or local media wiki installation) and manipulate and clean up html rendered.

doozan commented 12 years ago

You're completely correct about mwlib's value as a wikitext -> DOM converter. When I posted the earlier ideas about replacing mwlib, I did not realize that the xml dumps contained the articles in raw wikitext format. Since mwlib is actively maintained and commercially supported, and since the mwlib.xhtml and mwlib.cdb libraries can maintain comparability with aardtools, I see no compelling reason to move away from mwlib.

doozan commented 12 years ago

The October English Wikipedia dump is available on the dump site now, just waiting to be compiled.

From what I can tell, the language links returned in the siteinfo finally caught up to the actual status of Wikipedia and shouldn't need to be amended at the moment. Obviously, they may get out of sync again in the future. Here's how I checked for missing language links. (I also compiled 1000 entries and browsed around, but I think the automated steps should be enough)

# Generate site info
aard-siteinfo en.wikipedia.org > enwiki.json
# Fix the server url (url starts // when it should be http://)
sed -i 's|server": "//|server": "http://|' enwiki.json

# Generate files to check for missing language links
cat enwiki.json | grep prefix | cut -d "\"" -f 4 | sort > lang.enwiki
curl http://wikistats.wmflabs.org/api.php?action=dump\&table=wikipedias\&format=csv | tail -n +2 | cut -d "," -f 3 | sort > lang.wikistats
curl http://en.wikipedia.org/w/api.php?action=sitematrix | grep "language code=" | sed -e 's/^.\+language code=\"\([^\&]\+\).\+/\1/' | sort > lang.sitematrix

# The following commands will show any language language included in the wikistats or the wiki site matrix that is NOT included in the siteinfo,
# As of 10/2/2012 there are no missing language links so the following commands should output nothing
# If there are any missing languages, you must edit enwiki.json and add the languages to the interwikimap array
comm -1 -3 lang.enwiki lang.wikistats
comm -1 -3 lang.enwiki lang.sitematrix
itkach commented 12 years ago

@doozan I just finished compiling September enwiki dump with your aardtools version - looks good, uploading to Wuala now. It took ~ 4 days on 4 2.66 GHz HT cores i7 , so maybe we'll skip October :)

doozan commented 12 years ago

It also seems that the ISO Alphabet table is now called "Latin alphabet navbox" and needs to be filtered appropriately. I've found that it's faster (and reversible) to edit the cdb index to remove the offending template rather than filtering it of the xml with grep:

# Download the wikipedia dump
wget http://dumps.wikimedia.org/enwiki/20121001/enwiki-20121001-pages-articles.xml.bz2

# Convert the xml dump to cdb database
mw-buildcdb --input enwiki-20121001-pages-articles.xml.bz2 --output enwiki-20121001.cdb

# Rename the 'Template:Latin alphabet navbox' index in the database to disable its use in articles
sed -i 's/Template:\(Latin alphabet navbox[0-9]\)/DISABLED:\1/' enwiki-20121001.cdb/wikiidx.cdb

To achieve this, we're basically renaming the entry "Template:Latin alphabet navbox" to "DISABLED:Latin alphabet navbox". All index entries are immediately followed by a numberical index value, so the sed pattern contains a trailing [0-9] to ensure that we avoid matching entries like "Template:Latin alphabet navbox other template"

To avoid corrupting the cdb file, it is essential that the number of characters in the replacement text "DISABLED" (8 characters) matches the number of characters in the original text "Template" (8 characters)

To re-enable the template, just run sed -i 's/DISABLED:/Template:/' enwiki-20121001.cdb/wikiidx.cdb and the file should be restored to the original.

itkach commented 12 years ago

That's an interesting approach. Or perhaps Python class that provides access to cdb could exclude or otherwise manipulate templates.

doozan commented 12 years ago

That's probably a much better approach. Here's a very simple patch to achieve just that.

I'm glad to hear that a September dictionary is on the way, thank you for compiling that!

doozan commented 12 years ago

The page filtering works really well. It's also much faster than filtering the resulting XML. By removing a couple of templates before they get converted to XML and then removed in myaardhtmlwriter, we can squeeze a little more performance out of old aardc:

Here are the most commonly used templates when compiling the first 1000 entries of enwiki:

  count Template name
   1210 Template:Category handler
    836 Template:Namespace detect
    794 Template:Navbar
    793 Template:Category handler/numbered
    793 Template:Category handler/blacklist
    792 Template:If pagename
    786 Template:Navbox
    779 Template:Reflist
    742 Template:Citation/make link
    739 Template:Citation/core
    716 Template:Transclude
    639 Template:Sister
    611 Template:Side box
    595 Template:Hide in print
    554 Template:Rellink

I got the above template count by adding print name to the get_page function and then running aardc ... > templates.txt and then sort templates.txt | uniq -c | sort -nr > templates_sorted.txt

Since aard doesn't support categories anyhow, removing that template saves us from compiling it 1.2x for every articles. The same goes for Navbar, Navbox, and Sister. They're already being filtered out by myaardhtmlwriter's EXCLUDE_CLASSES so it's faster to simply exclude the templates.

EXCLUDED_PAGES = frozenset(('Template:Category handler','Template:Navbar','Template:Category handler/blacklist',
                            'Template:Category handler/numbered','Template:Navbox','Template:Sister','Template:Only in print',
                            'Template:Side box','Template:Fix','Template:Fix/category','Template:Commons',
                            'Template:Ambox','Template:DMCA','Template:Refimprove','Template:Latin alphabet navbox'))

Using the above exclusions, I can now compile the first 10000 entries simplewiki in 0:08:43, vs the previously optimized 0:10:13 reported above.

I only played with a handful of the templates, I'm sure there are plenty of other exclusions that could not only improve performance but also clean up the resulting pages.

itkach commented 12 years ago

Yes, this is probably the best way to filter unwanted content. Reading the list of excluded pages from a file that can be specified as a command line option would effectively solve #11.

itkach commented 12 years ago

For the record, here are my stats from compiling September enwiki dump:

100.00% t: 3 days, 2:40:09 avg: 36.0/s a: 4095367 r: 5568845 s: 0 e: 0 to: 24 f: 1 
...
Compilation took 4 days, 4:11:36

So it took 3 days, 2:40:09 to compile articles and more than a day to perform final volume assembly.

doozan commented 12 years ago

Reading the list of filters from a file would certainly be cleaner and easier for end users. Perhaps a JSON file with values for the page filter in wiki.py and the id/class filters in mwaardhtmlwriter.py. Ideally, you could even add support for the text filtering from your Wiktionary tree and merge that code back into the mainline so everything works off the same codebase.

I created a branch with the page filters that made the most sense to me and ran a test of 10000 enwiki articles for comparison. Here's the dictionary without page filtering, which took 1:12:02 to compile. And here's the dictionary with page filtering, which took 0:57:31 to compile. I've done a bit of browsing through both of them and I don't see any problems with the filtered version.

I think the current set of page filters is as speed optimized as it's going to get. Unless we want to talk about removing the References, Sources, External Links, and/or Further Reading sections at the end of each article, the remaining templates aren't used frequently enough to see a significant speed boost.

However, there are still many templates that could be removed to cleanup article formatting, which would likely be site-specific and best left for being read from a config file. Unfortunately, I'm not really a Python coder so I think that's an exercise best left for @itkach

doozan commented 12 years ago

It's nice to see the new dictionaries up, thank you for compiling those.

I found some time to play with moving the cleanup filters to a config file. The code isn't the cleanest, but it works. I included support for excluding templates, xml classes, xml ids, and text replacements via regex. Example filters for enwikipedia and enwiktionary are in /doc

@itkach you sure were right about the enwiktionary articles being messy! Do you remember any specific articles that used the problematic Unicode templates?