aarddict / tools

Tools for Aard Dictionary
GNU General Public License v3.0
14 stars 13 forks source link

aardc slow #15

Closed pez252 closed 11 years ago

pez252 commented 12 years ago

Compile time on aardc puts conversion of large dictionaries outside the reach of most users.

Expected process time was 66 days for enwiki on a machine with 4 processors and 2GB of ram. After moving to a machine with 32 processors and 100GB of RAM (some of which was used as a ramdisk for the source and destination files for aardc) it still took 18 hours.

If the process was faster ( less than 2 days on middle of the line current PC ) users would be more likely to compile current dictionaries for the community.

itkach commented 12 years ago

Anything that results in html with elements of class unicode, such as a or beer

Looks like this templed is handled properly in mwlib 0.14 though. Or maybe template definition itself changed in recent dumps and no longer causes problem?

itkach commented 12 years ago

I have a feeling that comments for this issue, while very useful, evolved into a rather general discussion about dictionary compilation. I propose to continue this in the Aard Dictionary Google Group and limit further comments for this issue to things directly related to compilation slowness and relevant aspects of aardc implementation.

itkach commented 11 years ago

Compiling February enwiki dump with page filtering and before reworked volume writing 5ed775aabff2e6f8f135dbc6468ce9e510963ff6, on quad core 2.66GHz CPU with HT (i7), 6Gb of RAM, 7200 RPM HDD:

100.00% t: 2 days, 11:58:13 avg: 46.4/s a: 4220222 r: 5792201 s: 0 e: 7 f: 2 
...
Compilation took 3 days, 5:48:30

The same after 5ed775aabff2e6f8f135dbc6468ce9e510963ff6

100.00% t: 2 days, 12:12:13 avg: 46.2/s a: 4220222 r: 5792201 s: 0 e: 7 f: 2 
...
Compilation took 2 days, 12:11:30

Also, with 5ed775aabff2e6f8f135dbc6468ce9e510963ff6 article temp storage file is no longer opened with mmap, so compiling with default volume size (2Gb) will probably now work on 32-bit.

I think these improvements put compiling even largest mediawiki dumps well within reach anyone with a reasonably modern desktop machine.

doozan commented 11 years ago

While I agree that the speed has increase enough to close this issue, I think there's something wrong with the overall time calculation. Shouldn't the total compilation time be greater than the article compilation time?

100.00% t: 2 days, 12:12:13 avg: 46.2/s a: 4220222 r: 5792201 s: 0 e: 7 f: 2 ... Compilation took 2 days, 12:11:30

itkach commented 11 years ago

Finalizing volumes is so fast now time actually goes back when it happens ;)