ColdMillenium / jwpl

Automatically exported from code.google.com/p/jwpl
0 stars 0 forks source link

TimeMachine throws exception when facing UTF surrogate character #8

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem:
Create a Wikipedia snapshot for 20090101 or 20080101 from the 
20100130-Wikipedia Dump (http://dumps.wikimedia.org/enwiki/20100130/) 
After Revision 7270000, the TimeMachine aborts with the following exception:

Exception in thread "xml2sql" java.lang.RuntimeException: java.io.IOException: 
Invalid byte 2 of 4-byte UTF-8 sequence.
    at de.tudarmstadt.ukp.wikipedia.timemachine.dump.xml.original.XMLDumpTableInputStreamThread.run(XMLDumpTableInputStreamThread.java:128)
Caused by: java.io.IOException: Invalid byte 2 of 4-byte UTF-8 sequence.
    at org.mediawiki.importer.XmlDumpReader.readDump(XmlDumpReader.java:92)
    at de.tudarmstadt.ukp.wikipedia.timemachine.dump.xml.original.XMLDumpTableInputStreamThread.run(XMLDumpTableInputStreamThread.java:123)
Caused by: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid 
byte 2 of 4-byte UTF-8 sequence.
    at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
    at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
    at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
    at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
    at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
    at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
    at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
    at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
    at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
    at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
    at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
    at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
    at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
    at org.mediawiki.importer.XmlDumpReader.readDump(XmlDumpReader.java:88)
    ... 1 more
Caused by: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid 
byte 2 of 4-byte UTF-8 sequence.
    at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown Source)
    at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
    at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
    at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
    at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source)
    ... 11 more
Write end dead

This is apparently caused by readDump() in org.mediawiki.importer.XmlDumpReader

Original issue reported on code.google.com by oliver.ferschke on 11 Feb 2011 at 9:08

GoogleCodeExporter commented 9 years ago
The problem was caused by the xercesImpl used by the mwdumper.
I have created a custom version of mwdumper 1.16, which is available from the 
artifactory on https://zoidberg.tk.informatik.tu-darmstadt.de/artifactory/

It uses a fixed version of xercesImpl (also available on zoidberg under 
xerces:xercesImpl-2.9.1-lucene) which includes the following patch 
https://issues.apache.org/jira/browse/XERCESJ-1257

Original comment by oliver.ferschke on 11 Apr 2011 at 3:03

GoogleCodeExporter commented 9 years ago

Original comment by oliver.ferschke on 16 Feb 2012 at 1:24