how do I parse wikipedia dump file?

earwig / mwparserfromhell

A Python parser for MediaWiki wikicode

https://mwparserfromhell.readthedocs.io/

MIT License

741 stars 74 forks source link

how do I parse wikipedia dump file? #294

Open sl2902 opened 1 year ago

sl2902 commented 1 year ago

Thanks for the library!

I have the latest xml dump file, and I would like to use your library to parse the infoboxes from the dump. However, I don't see any function to stream the file. Could you share an example of how I could pass the content of a page to the mwparserfromhell.parse(text) function to extract any infobox?

If this helps, this is what I have got so far

for _, elem in iter_lines():
    print(strip_tag_name(elem.tag))
    if strip_tag_name(elem.tag) == 'text':
        print(elem.text)

iter_lines() is a function which uses ET.iterparse() to incrementally parse the XML; it returns a generator

lahwaacz commented 1 year ago

This library parses the wikitext only. You need to use another library to parse the XML file to get the wikitext. See e.g. https://stackoverflow.com/questions/16533153/parse-xml-dump-of-a-mediawiki-wiki

sl2902 commented 1 year ago

On that link, it loads the entire file to memory; this will not be possible with the dump

lahwaacz commented 1 year ago

Then you need to find a different parser.

EvanGranthamBrown commented 1 year ago

Check out mwxml, a library designed for this specific task (parsing Wikipedia XML dumps):

import mwxml

file_location = "/path/to/wikipedia/dump.xml"

dump = mwxml.Dump.from_file(open(file_location))

for page in dump:
    for revision in page:
        parsed = mwparserfromhell.parse(revision.text)
        # do stuff with parsed

The mwxml Dump class is an iterator which reads pages one at a time, so you can avoid loading the whole file at once.