Open sidgitind opened 5 years ago
The dump I am using is enwiki-20190101-pages-articles-multistream.xml.bz2 and my machine is a Windows 10 laptop
@sidgitind As far as I understand, the part that you're feeding into the parser is not valid XML, because of the subset of lines: [-165:-109]. This subset of lines was valid for the 20180901 release used in the example, but not for the 20190101 release. Could you try a different subset, or experiment with removing the subset?
Hi @WillKoehrsen , I am trying to execute on the code in the notebook and ran into an error in the xml parser code. I am getting an SAX error at this code snippet. handler = WikiXmlHandler()
Parsing object
parser = xml.sax.make_parser() parser.setContentHandler(handler)
Iteratively process file
handler._pages
for l in lines[-165:-109]: parser.feed(l)
The stack trace is as follows
ExpatError Traceback (most recent call last) myprojectpath\lib\xml\sax\expatreader.py in feed(self, data, isFinal) 216 # except when invoked from close. --> 217 self._parser.Parse(data, isFinal) 218 except expat.error as e:
ExpatError: syntax error: line 2, column 0
During handling of the above exception, another exception occurred:
SAXParseException Traceback (most recent call last)