Closed agoyaliitk closed 9 years ago
On 4/10/2015 09:00, agoyaliitk wrote:
Can you explain why this error occurs?
— Reply to this email directly or view it on GitHub https://github.com/attardi/wikiextractor/issues/2.
Because thre are template definitions that invoke themselves recursively. In the case in question the template invocation
{{Multiple sclerosis}}
expands to a body
{{Navbox | name = Demyelinating diseases of CNS | title = [[Multiple sclerosis]] and other [[demyelinating disease]]s of [[Centr al nervous system|CNS]]([[ICD-10 Chapter VI: Diseases of the nervous system#%28G3 5–G37%29 Demyelinating diseases of the central nervous system|G35–G37]], [[List of ICD-9 codes 320–359: diseases of the nervous system#Other disorders of the cent ral nervous system %28340–349%29|340–341]]) |bodyclass = hlist |{{Multiple sclerosis|state=expanded}}) | titlestyle = background: Silver; ...
and the template expansion procedure would keep expanding forever. Templates are to be considered as macros, in which recursion is not allowed.
I added a check on the depth of recursive expansion, similar to the one used in the official code from MediaWiki, to handle these malformed templates.
-- Beppe
What can I do now to get past this? It's giving the memory error.
Please tell me the ID number of the article, that was printed before the Traceback, and the version of wikipedia dump you are using, so that I can investigate.
I guess the id you are asking for would be 66512. Wikipedia dump https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
I have again attached the error with some more detail. Thanks for your help:)
INFO:root:66495 Final Fantasy III
INFO:root:66496 Hippogriff
INFO:root:66499 Informal sector
INFO:root:66505 Secrecy
INFO:root:66511 MX record
INFO:root:66512 Fern
WARNING:root:Reached max template recursion: 16
WARNING:root:Reached max template recursion: 16
Traceback (most recent call last):
File "./WikiExtractor.py", line 1797, in
Wikipedia dump enwiki-latest-pages-articles.xml.bz2
06-Apr-2015 22:06
11820881800
I get a similar error (I edited the file a bit as I need only raw text output, no titles or urls, but that should not have changed anything in the core program):
File "WikiExtractor_v27s.py", line 789, in expandTemplate params = templateParams(parts[1:], depth) File "WikiExtractor_v27s.py", line 416, in templateParams parameters = [expandTemplates(p, frame) for p in parameters] File "WikiExtractor_v27s.py", line 327, in expandTemplates res += expandTemplate(text[s+2:e-2], depth+l) File "WikiExtractor_v27s.py", line 828, in expandTemplate ret = expandTemplates(template, depth + 1) File "WikiExtractor_v27s.py", line 333, in expandTemplates res += text[cur:] MemoryError
And in my case, it has reached 313280 articles before this error. The last article is:
945695 Canada at the 1904 Summer Olympics
It is a rather interesting memory consumption that I was seeing during the execution, so I took a screentshot at some point:
and the Wikipedia dump I use is: -- 2015-03-07 Recombine articles, templates, media/file descriptions, and primary meta-pages. -- enwiki-20150304-pages-articles.xml.bz2 10.9 GB
I fixed a few issues and I was able to process the latest Wikipedia dump. Processing the dump requires about 3GB of memory and runs for several hours. I have added the option: --no-templates for extracting text without expanding templates, as in the previous releases of WikiExtractor. This reduces the memory needed to about 500MB and speeds up significantly the processing, but all templates will be replaced with blanks.
I will try it again. Thanks
No luck. Still giving the same error.
INFO:root:66499 Informal sector
INFO:root:66505 Secrecy
INFO:root:66511 MX record
INFO:root:66512 Fern
WARNING:root:Max template recursion exceeded!
WARNING:root:Skipping page with empty title
WARNING:root:Max template recursion exceeded!
WARNING:root:Skipping page with empty title
WARNING:root:Max template recursion exceeded!
WARNING:root:Skipping page with empty title
Traceback (most recent call last):
File "./WikiExtractor.py", line 1838, in
Processing that file on my machine required 5GB of memory. So it is possible that on your machine the memory gets exhausted.
You can try reducing the maximum depth of recursion, by setting for example
maxTemplateRecursionLevels = 8
If that does not help, you will have to disable templates with option
--no-templates.
Let me know.
-- Beppe
On 4/11/2015 22:39, agoyaliitk wrote:
No change. Giving the same error again.
INFO:root:66499 Informal sector INFO:root:66505 Secrecy INFO:root:66511 MX record INFO:root:66512 Fern WARNING:root:Max template recursion exceeded! WARNING:root:Skipping page with empty title WARNING:root:Max template recursion exceeded! WARNING:root:Skipping page with empty title WARNING:root:Max template recursion exceeded! WARNING:root:Skipping page with empty title Traceback (most recent call last): File "./WikiExtractor.py", line 1838, in main() File "./WikiExtractor.py", line 1834, in main process_data(input_file, args.templates, output_splitter) File "./WikiExtractor.py", line 1658, in process_data extract(id, title, page, output) File "./WikiExtractor.py", line 154, in extract text = clean(text) File "./WikiExtractor.py", line 1293, in clean text = expandTemplates(text) File "./WikiExtractor.py", line 331, in expandTemplates res += expandTemplate(text[s+2:e-2], depth+l) File "./WikiExtractor.py", line 838, in expandTemplate ret = expandTemplates(template, depth + 1) File "./WikiExtractor.py", line 331, in expandTemplates res += expandTemplate(text[s+2:e-2], depth+l) File "./WikiExtractor.py", line 799, in expandTemplate params = templateParams(parts[1:], depth+1) File "./WikiExtractor.py", line 423, in templateParams parameters = [expandTemplates(p, depth) for p in parameters] File "./WikiExtractor.py", line 331, in expandTemplates res += expandTemplate(text[s+2:e-2], depth+l) File "./WikiExtractor.py", line 799, in expandTemplate params = templateParams(parts[1:], depth+1) File "./WikiExtractor.py", line 423, in templateParams parameters = [expandTemplates(p, depth) for p in parameters] File "./WikiExtractor.py", line 331, in expandTemplates res += expandTemplate(text[s+2:e-2], depth+l) File "./WikiExtractor.py", line 838, in expandTemplate ret = expandTemplates(template, depth + 1) File "./WikiExtractor.py", line 338, in expandTemplates res += text[cur:] MemoryError
— Reply to this email directly or view it on GitHub https://github.com/attardi/wikiextractor/issues/2#issuecomment-91922664.
I'll try. thanks
I have a similar problem with this article: INFO:root:1908699 Lepospondyli
. It takes a lot more time than other articles and then I get this output:
INFO:root:1908699 Lepospondyli
WARNING:root:Max template recursion exceeded!
WARNING:root:Skipping page with empty title
WARNING:root:Max template recursion exceeded!
WARNING:root:Skipping page with empty title
Traceback (most recent call last):
File "wikiextractor/WikiExtractor.py", line 1838, in <module>
main()
File "wikiextractor/WikiExtractor.py", line 1834, in main
process_data(input_file, args.templates, output_splitter)
File "wikiextractor/WikiExtractor.py", line 1658, in process_data
extract(id, title, page, output)
File "wikiextractor/WikiExtractor.py", line 154, in extract
text = clean(text)
File "wikiextractor/WikiExtractor.py", line 1293, in clean
text = expandTemplates(text)
File "wikiextractor/WikiExtractor.py", line 331, in expandTemplates
res += expandTemplate(text[s+2:e-2], depth+l)
File "wikiextractor/WikiExtractor.py", line 799, in expandTemplate
params = templateParams(parts[1:], depth+1)
File "wikiextractor/WikiExtractor.py", line 423, in templateParams
parameters = [expandTemplates(p, depth) for p in parameters]
File "wikiextractor/WikiExtractor.py", line 331, in expandTemplates
res += expandTemplate(text[s+2:e-2], depth+l)
File "wikiextractor/WikiExtractor.py", line 799, in expandTemplate
params = templateParams(parts[1:], depth+1)
File "wikiextractor/WikiExtractor.py", line 423, in templateParams
parameters = [expandTemplates(p, depth) for p in parameters]
File "wikiextractor/WikiExtractor.py", line 331, in expandTemplates
res += expandTemplate(text[s+2:e-2], depth+l)
File "wikiextractor/WikiExtractor.py", line 838, in expandTemplate
ret = expandTemplates(template, depth + 1)
File "wikiextractor/WikiExtractor.py", line 338, in expandTemplates
res += text[cur:]
MemoryError
I had 8 GB of memory reserved for the process.
Got the same error as cifkao. Memory error after article no. 1908699 Lepospondyli
I have committed a new version that should fix the memory problems. I completely revised the strategy of parameter evaluation. For example, in article n. 3616279 Arthrodira, there was a very deep dendogram whose expansion was exponential on depth. Now parameters are expanded before substitution and this solves the problem. Please try it.
To all of you who complained about memory or speed problems with WikiExtractor, I released a new version that performs better and keeps a cache of parsed templates.
I have tested it on the English Wikipedia and it runs 5 times faster, while using 20% more memory (4GB). There were also numerous bug fixes.
There is also a new command line option --xml that attempts to produce HTML instead of pure text, preserving headings, lists and links.
Thank you for your patience.
-- Beppe Attardi
Can you explain why this error occurs? I used the updated version of the script uploaded yesterday. Now it's giving this error.
Traceback (most recent call last): File "./WikiExtractor.py", line 1797, in
main()
File "./WikiExtractor.py", line 1793, in main
process_data(input_file, args.templates, output_splitter)
File "./WikiExtractor.py", line 1621, in process_data
extract(id, title, page, output)
File "./WikiExtractor.py", line 132, in extract
text = clean(text)
File "./WikiExtractor.py", line 1256, in clean
text = expandTemplates(text)
File "./WikiExtractor.py", line 307, in expandTemplates
res += expandTemplate(text[s+2:e-2], depth+l)
File "./WikiExtractor.py", line 808, in expandTemplate
ret = expandTemplates(template, depth + 1)
File "./WikiExtractor.py", line 307, in expandTemplates
res += expandTemplate(text[s+2:e-2], depth+l)
File "./WikiExtractor.py", line 769, in expandTemplate
params = templateParams(parts[1:], depth)
File "./WikiExtractor.py", line 396, in templateParams
parameters = [expandTemplates(p, frame) for p in parameters]
File "./WikiExtractor.py", line 307, in expandTemplates
res += expandTemplate(text[s+2:e-2], depth+l)
File "./WikiExtractor.py", line 769, in expandTemplate
params = templateParams(parts[1:], depth)
File "./WikiExtractor.py", line 396, in templateParams
parameters = [expandTemplates(p, frame) for p in parameters]
File "./WikiExtractor.py", line 307, in expandTemplates
res += expandTemplate(text[s+2:e-2], depth+l)
File "./WikiExtractor.py", line 808, in expandTemplate
ret = expandTemplates(template, depth + 1)
File "./WikiExtractor.py", line 313, in expandTemplates
res += text[cur:]
MemoryError