attardi / wikiextractor

A tool for extracting plain text from Wikipedia dumps
GNU Affero General Public License v3.0
3.69k stars 959 forks source link

Templates don't get expanded #151

Open dnishiyama opened 6 years ago

dnishiyama commented 6 years ago

Any idea why none of the templates get expanded? I ran WikiExtractor.py an initial time and saved all templates to a file (named "templates", it's 2358539 lines long) to try to debug. I'm trying to extract all wiktionary articles but the resulting text looks like this (blank text in place of templates):

" dictionary , from , from , from , perfect past participle of + . For more, see

This was the command I ran: python WikiExtractor.py -o extracted --debug --templates templates enwiktionary-sample-pages-articles.xml

This was the output: INFO: Loading template definitions from: templates INFO: Loaded 74373 templates in 24.6s INFO: Starting page extraction from enwiktionary-sample-pages-articles.xml. INFO: Using 7 extract processes. INFO: 16 dictionary INFO: 19 free INFO: 20 thesaurus DEBUG: EXPAND also|Dictionary INFO: 27 encyclopedia DEBUG: Quit extractor INFO: 29 portmanteau DEBUG: Quit extractor DEBUG: <EXPAND Template:Also DEBUG: EXPAND wikipedia|dab=Dictionary (disambiguation)|Dictionary DEBUG: <EXPAND Template:Wikipedia DEBUG: EXPAND PIE root|en|deyḱ DEBUG: TEMPLATE Template:PIE root: {{catlangname|{{{1|}}}|terms derived from the PIE root {{{2|}}}-{{#if:{{{id|{{{id1|}}}}}}| ({{{id|{{{id1|}}}}}})}}}}{{#if:{{{3|}}}|{{catlangname|{{{1|}}}|terms derived from the PIE root {{{3}}}-{{#if:{{{id2|}}}| ({{{id2|}}})}}}}}}{{#if:{{{4|}}}|{{catlangname|{{{1|}}}|terms derived from the PIE root {{{4}}}-{{#if:{{{id3|}}}| ({{{id3|}}})}}}}}}
DEBUG: EXPAND catlangname|en|terms derived from the PIE root
deyḱ-{{#if:| ()}} DEBUG: <EXPAND Template:Catlangname DEBUG: EXPAND #if:|{{catlangname|en|terms derived from the PIE root -{{#if:| ()}}}} DEBUG: EXPAND also|-free DEBUG: <EXPAND #if DEBUG: EXPAND #if:|{{catlangname|en|terms derived from the PIE root -{{#if:| ()}}}} DEBUG: EXPAND also|Thesaurus|thésaurus DEBUG: <EXPAND #if DEBUG: <EXPAND Template:PIE root
DEBUG: EXPAND bor|en|ML.|dictionarium|withtext=1 DEBUG: <EXPAND Template:Bor DEBUG: EXPAND der|en|la|dictionarius DEBUG: EXPAND was wotd|2007|March|8 DEBUG: <EXPAND Template:Der DEBUG: EXPAND wikipedia

I have been working on extracting templates for months and this looks like an amazing tool if I can get it to work. Thanks for all the work you all are doing on it!

mhagiwara commented 5 years ago

@dnishiyama Do you still have this issue? I also encountered a similar problem, and it seems that there is an issue with the current script when it's applied to Wiktionary dumps. Specifically, when it expands templates, it tries to "normalize" template titles by converting the first letter of the template to upper case, although template titles are stored without normalization.

After removing those applications of ucfirst things seem to be working correctly for me.

dnishiyama commented 5 years ago

Thanks for the reply. I do still have the issue and have since moved on to a different technique to gather this data from wikitionary (scrapy + bs4). If I get a chance I'll check out your recommendation. This would be a much better option if it does work.

KylePiira commented 5 years ago

I am also encountering this problem on the July 20th, 2018 English Wikipedia dump. Here was my command:

python WikiExtractor.py --o 'articles/' --templates 'templates.temp' --filter_disambig_pages --json 'enwiki.xml'

Here is an example of an incorrectly extracted sentence from Wikipedia Page ID 12.

WikiExtractor Output: The word "anarchism" is composed from the word "anarchy" and the suffix -ism, themselves derived respectively from the Greek , i.e. "anarchy" (from , "anarchos", meaning "one without rulers"; from the privative prefix ἀν- ("an-", i.e. "without") and , "archos", i.e. "leader", "ruler"; (cf. "archon" or , "arkhē", i.e. "authority", "sovereignty", "realm", "magistracy")) and the suffix or ("-ismos", "-isma", from the verbal infinitive suffix , "-izein").

Real Wikipedia Value: The word "anarchism" is composed from the word "anarchy" and the suffix -ism, themselves derived respectively from the Greek ἀναρχία, i.e. anarchy (from ἄναρχος, anarchos, meaning "one without rulers"; from the privative prefix ἀν- (an-, i.e. "without") and ἀρχός, archos, i.e. "leader", "ruler"; (cf. archon or ἀρχή, arkhē, i.e. "authority", "sovereignty", "realm", "magistracy")) and the suffix -ισμός or -ισμα (-ismos, -isma, from the verbal infinitive suffix -ίζειν, -izein).

I've also found other types of template expansions missing such as distance measurements.

wanicca commented 5 years ago

It seems that the template expansions don't work well now. I found a lot of wrongly parsed text in the output.

chaojiang06 commented 4 years ago

Hi,

I found an old version at http://medialab.di.unipi.it/wiki/Wikipedia_Extractor. It works well.

You need to use python2 to run it.

Thanks!

chaojiang06 commented 4 years ago

@dnishiyama Do you still have this issue? I also encountered a similar problem, and it seems that there is an issue with the current script when it's applied to Wiktionary dumps. Specifically, when it expands templates, it tries to "normalize" template titles by converting the first letter of the template to upper case, although template titles are stored without normalization.

After removing those applications of ucfirst things seem to be working correctly for me.

Hi, thank you for your suggestion! I tried to disable the ucfirst function. Basically, let the string keep unchanged, but it still doesn't work.

Would you mind to share the updated code on GitHub? I would be really appreciated it.

Thank you!