XML parse error - Githubissues

GoogleCodeExporter commented 9 years ago

=> What steps will reproduce the problem?
1. Run the parsing data with following code
public static void main(String[] args) throws Exception {
    File dumpFile = new File(PATH_TO_DUMP_FILE);
    File outputDirectory = new File(TARGET_DIRECTORY);
    boolean overwriteExisting = OVERWRITE_EXISTING_FILES;

    JWKTL.parseWiktionaryDump(dumpFile, outputDirectory, overwriteExisting);

2. Using the 2 latest dump datafiles from Wiktionary
     enwiktionary-20140504-pages-articles.xml
     enwiktionary-latest-pages-articles.xml

=> What is the expected output? What do you see instead?
INFO: Parsed 775000 pages
Exception in thread "main" de.tudarmstadt.ukp.jwktl.api.WiktionaryException: 
XML parse error
    at de.tudarmstadt.ukp.jwktl.parser.XMLDumpParser.parse(XMLDumpParser.java:140)
    at de.tudarmstadt.ukp.jwktl.parser.WiktionaryDumpParser.parse(WiktionaryDumpParser.java:74)
    at de.tudarmstadt.ukp.jwktl.JWKTL.parseWiktionaryDump(JWKTL.java:143)
    at de.tudarmstadt.ukp.jwktl.JWKTL.parseWiktionaryDump(JWKTL.java:117)
    at ParsingData.main(ParsingData.java:12)
Caused by: org.xml.sax.SAXParseException; lineNumber: 34869191; columnNumber: 
5; Invalid byte 2 of 4-byte UTF-8 sequence.
    at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
    at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
    at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
    at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
    at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
    at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
    at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
    at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
    at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
    at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
    at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
    at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
    at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
    at de.tudarmstadt.ukp.jwktl.parser.XMLDumpParser.parse(XMLDumpParser.java:131)
    ... 4 more
Caused by: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid 
byte 2 of 4-byte UTF-8 sequence.
    at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown Source)
    at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
    at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
    at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
    at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source)
    ... 14 more

=> What version of the product are you using? On what operating system?
   <dependency>
     <groupId>de.tudarmstadt.ukp.jwktl</groupId>
     <artifactId>jwktl</artifactId>
     <version>1.0.0</version>
   </dependency>

=> Please provide any additional information below.
However, the parsing went through successfully with the dump file 
"enwiktionary-20140415-pages-articles.xml"

Original issue reported on code.google.com by n...@fbk.eu on 21 May 2014 at 1:34

GoogleCodeExporter commented 9 years ago

Seems like someone has typed a non-UTF-8 character into a Wiktionary article, 
which hasn't be cleaned by the database dump application. That is, there is a 4 
byte character sequence in the latest.xml, which does not follow the expected 
UTF-8 sequence pattern (11110xxx 10xxxxxx 10xxxxxx 10xxxxxx). Best solution is 
to remove these invalid characters.

A quick&dirty hack might be 
http://stackoverflow.com/questions/12999651/how-to-remove-non-utf-8-characters-f
rom-text-file

A more elaborate idea is 
http://www.mkyong.com/java/sax-error-malformedbytesequenceexception-invalid-byte
-1-of-1-byte-utf-8-sequence/

I don't know what helps and if there's an easy way of stripping non-UTF-8 
characters from the input file easily. Would be nice if you could report back 
and potentially submit a patch, since I suspect that other users will run in 
the same issue.

Original comment by chmeyer.de on 21 May 2014 at 10:03

Changed state: Accepted
Added labels: Component-Parser
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Probably fixed with jwktl-1.0.1. Please try again using the new version. Note 
that AFAIK the UTF-8 bug still exists, but it is either ignored by the current 
xerces version or the current XML dump has been fixed w.r.t. this issue.

Original comment by chmeyer.de on 30 Sep 2014 at 12:27

Changed state: Fixed
Added labels: ****
Removed labels: ****

JohnSatriano / jwktl

XML parse error #6