Open sl2902 opened 1 year ago
This library parses the wikitext only. You need to use another library to parse the XML file to get the wikitext. See e.g. https://stackoverflow.com/questions/16533153/parse-xml-dump-of-a-mediawiki-wiki
On that link, it loads the entire file to memory; this will not be possible with the dump
Then you need to find a different parser.
Check out mwxml, a library designed for this specific task (parsing Wikipedia XML dumps):
import mwxml
file_location = "/path/to/wikipedia/dump.xml"
dump = mwxml.Dump.from_file(open(file_location))
for page in dump:
for revision in page:
parsed = mwparserfromhell.parse(revision.text)
# do stuff with parsed
The mwxml Dump class is an iterator which reads pages one at a time, so you can avoid loading the whole file at once.
Thanks for the library!
I have the latest xml dump file, and I would like to use your library to parse the infoboxes from the dump. However, I don't see any function to stream the file. Could you share an example of how I could pass the content of a page to the mwparserfromhell.parse(text) function to extract any infobox?
If this helps, this is what I have got so far
iter_lines() is a function which uses ET.iterparse() to incrementally parse the XML; it returns a generator