attardi / wikiextractor

A tool for extracting plain text from Wikipedia dumps
GNU Affero General Public License v3.0
3.74k stars 965 forks source link

'maximum template recursion' error after a few hours #2

Closed agoyaliitk closed 9 years ago

agoyaliitk commented 9 years ago

Can you explain why this error occurs? I used the updated version of the script uploaded yesterday. Now it's giving this error.

Traceback (most recent call last): File "./WikiExtractor.py", line 1797, in main() File "./WikiExtractor.py", line 1793, in main process_data(input_file, args.templates, output_splitter) File "./WikiExtractor.py", line 1621, in process_data extract(id, title, page, output) File "./WikiExtractor.py", line 132, in extract text = clean(text) File "./WikiExtractor.py", line 1256, in clean text = expandTemplates(text) File "./WikiExtractor.py", line 307, in expandTemplates res += expandTemplate(text[s+2:e-2], depth+l) File "./WikiExtractor.py", line 808, in expandTemplate ret = expandTemplates(template, depth + 1) File "./WikiExtractor.py", line 307, in expandTemplates res += expandTemplate(text[s+2:e-2], depth+l) File "./WikiExtractor.py", line 769, in expandTemplate params = templateParams(parts[1:], depth) File "./WikiExtractor.py", line 396, in templateParams parameters = [expandTemplates(p, frame) for p in parameters] File "./WikiExtractor.py", line 307, in expandTemplates res += expandTemplate(text[s+2:e-2], depth+l) File "./WikiExtractor.py", line 769, in expandTemplate params = templateParams(parts[1:], depth) File "./WikiExtractor.py", line 396, in templateParams parameters = [expandTemplates(p, frame) for p in parameters] File "./WikiExtractor.py", line 307, in expandTemplates res += expandTemplate(text[s+2:e-2], depth+l) File "./WikiExtractor.py", line 808, in expandTemplate ret = expandTemplates(template, depth + 1) File "./WikiExtractor.py", line 313, in expandTemplates res += text[cur:] MemoryError

attardi commented 9 years ago

On 4/10/2015 09:00, agoyaliitk wrote:

Can you explain why this error occurs?

— Reply to this email directly or view it on GitHub https://github.com/attardi/wikiextractor/issues/2.

Because thre are template definitions that invoke themselves recursively. In the case in question the template invocation

{{Multiple sclerosis}}

expands to a body

{{Navbox | name = Demyelinating diseases of CNS | title = [[Multiple sclerosis]] and other [[demyelinating disease]]s of [[Centr al nervous system|CNS]]([[ICD-10 Chapter VI: Diseases of the nervous system#%28G3 5–G37%29 Demyelinating diseases of the central nervous system|G35–G37]], [[List of ICD-9 codes 320–359: diseases of the nervous system#Other disorders of the cent ral nervous system %28340–349%29|340–341]]) |bodyclass = hlist |{{Multiple sclerosis|state=expanded}}) | titlestyle = background: Silver; ...

and the template expansion procedure would keep expanding forever. Templates are to be considered as macros, in which recursion is not allowed.

I added a check on the depth of recursive expansion, similar to the one used in the official code from MediaWiki, to handle these malformed templates.

-- Beppe

agoyaliitk commented 9 years ago

What can I do now to get past this? It's giving the memory error.

attardi commented 9 years ago

Please tell me the ID number of the article, that was printed before the Traceback, and the version of wikipedia dump you are using, so that I can investigate.

agoyaliitk commented 9 years ago

I guess the id you are asking for would be 66512. Wikipedia dump https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

I have again attached the error with some more detail. Thanks for your help:)

INFO:root:66495 Final Fantasy III INFO:root:66496 Hippogriff INFO:root:66499 Informal sector INFO:root:66505 Secrecy INFO:root:66511 MX record INFO:root:66512 Fern WARNING:root:Reached max template recursion: 16 WARNING:root:Reached max template recursion: 16 Traceback (most recent call last): File "./WikiExtractor.py", line 1797, in main() File "./WikiExtractor.py", line 1793, in main process_data(input_file, args.templates, output_splitter) File "./WikiExtractor.py", line 1621, in process_data extract(id, title, page, output) File "./WikiExtractor.py", line 132, in extract text = clean(text) File "./WikiExtractor.py", line 1256, in clean text = expandTemplates(text) File "./WikiExtractor.py", line 307, in expandTemplates res += expandTemplate(text[s+2:e-2], depth+l) File "./WikiExtractor.py", line 808, in expandTemplate ret = expandTemplates(template, depth + 1) File "./WikiExtractor.py", line 307, in expandTemplates res += expandTemplate(text[s+2:e-2], depth+l) File "./WikiExtractor.py", line 769, in expandTemplate params = templateParams(parts[1:], depth) File "./WikiExtractor.py", line 396, in templateParams parameters = [expandTemplates(p, frame) for p in parameters] File "./WikiExtractor.py", line 307, in expandTemplates res += expandTemplate(text[s+2:e-2], depth+l) File "./WikiExtractor.py", line 769, in expandTemplate params = templateParams(parts[1:], depth) File "./WikiExtractor.py", line 396, in templateParams parameters = [expandTemplates(p, frame) for p in parameters] File "./WikiExtractor.py", line 307, in expandTemplates res += expandTemplate(text[s+2:e-2], depth+l) File "./WikiExtractor.py", line 808, in expandTemplate ret = expandTemplates(template, depth + 1) File "./WikiExtractor.py", line 313, in expandTemplates res += text[cur:] MemoryError

agoyaliitk commented 9 years ago

Wikipedia dump enwiki-latest-pages-articles.xml.bz2

06-Apr-2015 22:06

11820881800

sanja7s commented 9 years ago

I get a similar error (I edited the file a bit as I need only raw text output, no titles or urls, but that should not have changed anything in the core program):

File "WikiExtractor_v27s.py", line 789, in expandTemplate params = templateParams(parts[1:], depth) File "WikiExtractor_v27s.py", line 416, in templateParams parameters = [expandTemplates(p, frame) for p in parameters] File "WikiExtractor_v27s.py", line 327, in expandTemplates res += expandTemplate(text[s+2:e-2], depth+l) File "WikiExtractor_v27s.py", line 828, in expandTemplate ret = expandTemplates(template, depth + 1) File "WikiExtractor_v27s.py", line 333, in expandTemplates res += text[cur:] MemoryError

And in my case, it has reached 313280 articles before this error. The last article is:

945695 Canada at the 1904 Summer Olympics

It is a rather interesting memory consumption that I was seeing during the execution, so I took a screentshot at some point:

mem_consumption_wikiextract

and the Wikipedia dump I use is: -- 2015-03-07 Recombine articles, templates, media/file descriptions, and primary meta-pages. -- enwiki-20150304-pages-articles.xml.bz2 10.9 GB

attardi commented 9 years ago

I fixed a few issues and I was able to process the latest Wikipedia dump. Processing the dump requires about 3GB of memory and runs for several hours. I have added the option: --no-templates for extracting text without expanding templates, as in the previous releases of WikiExtractor. This reduces the memory needed to about 500MB and speeds up significantly the processing, but all templates will be replaced with blanks.

agoyaliitk commented 9 years ago

I will try it again. Thanks

agoyaliitk commented 9 years ago

No luck. Still giving the same error.

INFO:root:66499 Informal sector INFO:root:66505 Secrecy INFO:root:66511 MX record INFO:root:66512 Fern WARNING:root:Max template recursion exceeded! WARNING:root:Skipping page with empty title WARNING:root:Max template recursion exceeded! WARNING:root:Skipping page with empty title WARNING:root:Max template recursion exceeded! WARNING:root:Skipping page with empty title Traceback (most recent call last): File "./WikiExtractor.py", line 1838, in main() File "./WikiExtractor.py", line 1834, in main process_data(input_file, args.templates, output_splitter) File "./WikiExtractor.py", line 1658, in process_data extract(id, title, page, output) File "./WikiExtractor.py", line 154, in extract text = clean(text) File "./WikiExtractor.py", line 1293, in clean text = expandTemplates(text) File "./WikiExtractor.py", line 331, in expandTemplates res += expandTemplate(text[s+2:e-2], depth+l) File "./WikiExtractor.py", line 838, in expandTemplate ret = expandTemplates(template, depth + 1) File "./WikiExtractor.py", line 331, in expandTemplates res += expandTemplate(text[s+2:e-2], depth+l) File "./WikiExtractor.py", line 799, in expandTemplate params = templateParams(parts[1:], depth+1) File "./WikiExtractor.py", line 423, in templateParams parameters = [expandTemplates(p, depth) for p in parameters] File "./WikiExtractor.py", line 331, in expandTemplates res += expandTemplate(text[s+2:e-2], depth+l) File "./WikiExtractor.py", line 799, in expandTemplate params = templateParams(parts[1:], depth+1) File "./WikiExtractor.py", line 423, in templateParams parameters = [expandTemplates(p, depth) for p in parameters] File "./WikiExtractor.py", line 331, in expandTemplates res += expandTemplate(text[s+2:e-2], depth+l) File "./WikiExtractor.py", line 838, in expandTemplate ret = expandTemplates(template, depth + 1) File "./WikiExtractor.py", line 338, in expandTemplates res += text[cur:] MemoryError

attardi commented 9 years ago

Processing that file on my machine required 5GB of memory. So it is possible that on your machine the memory gets exhausted.

You can try reducing the maximum depth of recursion, by setting for example

maxTemplateRecursionLevels = 8

If that does not help, you will have to disable templates with option

--no-templates.

Let me know.

-- Beppe

On 4/11/2015 22:39, agoyaliitk wrote:

No change. Giving the same error again.

INFO:root:66499 Informal sector INFO:root:66505 Secrecy INFO:root:66511 MX record INFO:root:66512 Fern WARNING:root:Max template recursion exceeded! WARNING:root:Skipping page with empty title WARNING:root:Max template recursion exceeded! WARNING:root:Skipping page with empty title WARNING:root:Max template recursion exceeded! WARNING:root:Skipping page with empty title Traceback (most recent call last): File "./WikiExtractor.py", line 1838, in main() File "./WikiExtractor.py", line 1834, in main process_data(input_file, args.templates, output_splitter) File "./WikiExtractor.py", line 1658, in process_data extract(id, title, page, output) File "./WikiExtractor.py", line 154, in extract text = clean(text) File "./WikiExtractor.py", line 1293, in clean text = expandTemplates(text) File "./WikiExtractor.py", line 331, in expandTemplates res += expandTemplate(text[s+2:e-2], depth+l) File "./WikiExtractor.py", line 838, in expandTemplate ret = expandTemplates(template, depth + 1) File "./WikiExtractor.py", line 331, in expandTemplates res += expandTemplate(text[s+2:e-2], depth+l) File "./WikiExtractor.py", line 799, in expandTemplate params = templateParams(parts[1:], depth+1) File "./WikiExtractor.py", line 423, in templateParams parameters = [expandTemplates(p, depth) for p in parameters] File "./WikiExtractor.py", line 331, in expandTemplates res += expandTemplate(text[s+2:e-2], depth+l) File "./WikiExtractor.py", line 799, in expandTemplate params = templateParams(parts[1:], depth+1) File "./WikiExtractor.py", line 423, in templateParams parameters = [expandTemplates(p, depth) for p in parameters] File "./WikiExtractor.py", line 331, in expandTemplates res += expandTemplate(text[s+2:e-2], depth+l) File "./WikiExtractor.py", line 838, in expandTemplate ret = expandTemplates(template, depth + 1) File "./WikiExtractor.py", line 338, in expandTemplates res += text[cur:] MemoryError

— Reply to this email directly or view it on GitHub https://github.com/attardi/wikiextractor/issues/2#issuecomment-91922664.

agoyaliitk commented 9 years ago

I'll try. thanks

cifkao commented 9 years ago

I have a similar problem with this article: INFO:root:1908699 Lepospondyli. It takes a lot more time than other articles and then I get this output:

INFO:root:1908699       Lepospondyli
WARNING:root:Max template recursion exceeded!
WARNING:root:Skipping page with empty title
WARNING:root:Max template recursion exceeded!
WARNING:root:Skipping page with empty title
Traceback (most recent call last):
  File "wikiextractor/WikiExtractor.py", line 1838, in <module>
    main()
  File "wikiextractor/WikiExtractor.py", line 1834, in main
    process_data(input_file, args.templates, output_splitter)
  File "wikiextractor/WikiExtractor.py", line 1658, in process_data
    extract(id, title, page, output)
  File "wikiextractor/WikiExtractor.py", line 154, in extract
    text = clean(text)
  File "wikiextractor/WikiExtractor.py", line 1293, in clean
    text = expandTemplates(text)
  File "wikiextractor/WikiExtractor.py", line 331, in expandTemplates
    res += expandTemplate(text[s+2:e-2], depth+l)
  File "wikiextractor/WikiExtractor.py", line 799, in expandTemplate
    params = templateParams(parts[1:], depth+1)
  File "wikiextractor/WikiExtractor.py", line 423, in templateParams
    parameters = [expandTemplates(p, depth) for p in parameters]
  File "wikiextractor/WikiExtractor.py", line 331, in expandTemplates
    res += expandTemplate(text[s+2:e-2], depth+l)
  File "wikiextractor/WikiExtractor.py", line 799, in expandTemplate
    params = templateParams(parts[1:], depth+1)
  File "wikiextractor/WikiExtractor.py", line 423, in templateParams
    parameters = [expandTemplates(p, depth) for p in parameters]
  File "wikiextractor/WikiExtractor.py", line 331, in expandTemplates
    res += expandTemplate(text[s+2:e-2], depth+l)
  File "wikiextractor/WikiExtractor.py", line 838, in expandTemplate
    ret = expandTemplates(template, depth + 1)
  File "wikiextractor/WikiExtractor.py", line 338, in expandTemplates
    res += text[cur:]
MemoryError

I had 8 GB of memory reserved for the process.

agoyaliitk commented 9 years ago

Got the same error as cifkao. Memory error after article no. 1908699 Lepospondyli

attardi commented 9 years ago

I have committed a new version that should fix the memory problems. I completely revised the strategy of parameter evaluation. For example, in article n. 3616279 Arthrodira, there was a very deep dendogram whose expansion was exponential on depth. Now parameters are expanded before substitution and this solves the problem. Please try it.

attardi commented 9 years ago

To all of you who complained about memory or speed problems with WikiExtractor, I released a new version that performs better and keeps a cache of parsed templates.

I have tested it on the English Wikipedia and it runs 5 times faster, while using 20% more memory (4GB). There were also numerous bug fixes.

There is also a new command line option --xml that attempts to produce HTML instead of pure text, preserving headings, lists and links.

Thank you for your patience.

-- Beppe Attardi