WillKoehrsen / wikipedia-data-science

Working with and analyzing Wikipedia Data
692 stars 287 forks source link

SAXParseException: <unknown>:2:0: syntax error #4

Open sidgitind opened 5 years ago

sidgitind commented 5 years ago

Hi @WillKoehrsen , I am trying to execute on the code in the notebook and ran into an error in the xml parser code. I am getting an SAX error at this code snippet. handler = WikiXmlHandler()

Parsing object

parser = xml.sax.make_parser() parser.setContentHandler(handler)

Iteratively process file

handler._pages

for l in lines[-165:-109]: parser.feed(l)

The stack trace is as follows


ExpatError Traceback (most recent call last) myprojectpath\lib\xml\sax\expatreader.py in feed(self, data, isFinal) 216 # except when invoked from close. --> 217 self._parser.Parse(data, isFinal) 218 except expat.error as e:

ExpatError: syntax error: line 2, column 0

During handling of the above exception, another exception occurred:

SAXParseException Traceback (most recent call last)

in () 40 41 for l in lines[-165:-109]: ---> 42 parser.feed(l) 43 44 print(handler._pages) myprojectpath\lib\xml\sax\expatreader.py in feed(self, data, isFinal) 219 exc = SAXParseException(expat.ErrorString(e.code), e, self) 220 # FIXME: when to invoke error()? --> 221 self._err_handler.fatalError(exc) 222 223 def _close_source(self): myprojectpath\lib\xml\sax\handler.py in fatalError(self, exception) 36 def fatalError(self, exception): 37 "Handle a non-recoverable error." ---> 38 raise exception 39 40 def warning(self, exception): SAXParseException: :2:0: syntax error Appreciate your time in helping me proceed further and handling this issue. Thanks
sidgitind commented 5 years ago

The dump I am using is enwiki-20190101-pages-articles-multistream.xml.bz2 and my machine is a Windows 10 laptop

sandertan commented 4 years ago

@sidgitind As far as I understand, the part that you're feeding into the parser is not valid XML, because of the subset of lines: [-165:-109]. This subset of lines was valid for the 20180901 release used in the example, but not for the 20190101 release. Could you try a different subset, or experiment with removing the subset?