Closed pez252 closed 11 years ago
I have a feeling that comments for this issue, while very useful, evolved into a rather general discussion about dictionary compilation. I propose to continue this in the Aard Dictionary Google Group and limit further comments for this issue to things directly related to compilation slowness and relevant aspects of aardc implementation.
Compiling February enwiki dump with page filtering and before reworked volume writing 5ed775aabff2e6f8f135dbc6468ce9e510963ff6, on quad core 2.66GHz CPU with HT (i7), 6Gb of RAM, 7200 RPM HDD:
100.00% t: 2 days, 11:58:13 avg: 46.4/s a: 4220222 r: 5792201 s: 0 e: 7 f: 2
...
Compilation took 3 days, 5:48:30
The same after 5ed775aabff2e6f8f135dbc6468ce9e510963ff6
100.00% t: 2 days, 12:12:13 avg: 46.2/s a: 4220222 r: 5792201 s: 0 e: 7 f: 2
...
Compilation took 2 days, 12:11:30
Also, with 5ed775aabff2e6f8f135dbc6468ce9e510963ff6 article temp storage file is no longer opened with mmap, so compiling with default volume size (2Gb) will probably now work on 32-bit.
I think these improvements put compiling even largest mediawiki dumps well within reach anyone with a reasonably modern desktop machine.
While I agree that the speed has increase enough to close this issue, I think there's something wrong with the overall time calculation. Shouldn't the total compilation time be greater than the article compilation time?
100.00% t: 2 days, 12:12:13 avg: 46.2/s a: 4220222 r: 5792201 s: 0 e: 7 f: 2 ... Compilation took 2 days, 12:11:30
Finalizing volumes is so fast now time actually goes back when it happens ;)
Compile time on aardc puts conversion of large dictionaries outside the reach of most users.
Expected process time was 66 days for enwiki on a machine with 4 processors and 2GB of ram. After moving to a machine with 32 processors and 100GB of RAM (some of which was used as a ramdisk for the source and destination files for aardc) it still took 18 hours.
If the process was faster ( less than 2 days on middle of the line current PC ) users would be more likely to compile current dictionaries for the community.