Closed fmatthies closed 6 years ago
com.ximpleware.ParseException: Error in text content: Invalid char in text content Line Number: 45 Offset: 89
at com.ximpleware.VTDGen.handleOtherTextChar(VTDGen.java:5160) ~[vtd-xml-2.11.jar:na]
at com.ximpleware.VTDGen.parse(VTDGen.java:2474) ~[vtd-xml-2.11.jar:na]
at de.julielab.jcore.reader.xmlmapper.mapper.XMLMapper.parse(XMLMapper.java:95) ~[jcore-xml-mapper-2.2.0.jar:na]
at de.julielab.jules.reader.DBMedlineReader.getNext(DBMedlineReader.java:199) [jules-medline-reader-3.0.2.jar:na]
at org.apache.uima.collection.impl.cpm.engine.ArtifactProducer.readNext(ArtifactProducer.java:494) [uimaj-cpe-2.5.0.jar:2.5.0]
at org.apache.uima.collection.impl.cpm.engine.ArtifactProducer.run(ArtifactProducer.java:711) [uimaj-cpe-2.5.0.jar:2.5.0]
The issue is as follows: We VTD-XML to import Medline XML into the database. Since VTD cannot deal with Unicode Supplementary characters properly, the resulting XML contains invalid characters (control sequences or whatever). What happens is that the supplementary characters - which start at codepoint U+10000 and thus need more than 16bit - are represented via surrogate pairs. Each surrogate has 16bit. VTD only uses the lower 16bit. This is why this above error is thrown when trying to parse such corrupted character streams from the database. This means we have to resolve this issue BEFORE importing into the database. A solution to the issue could be to replace such characters with a placeholder like ###UNICODE_SUPP_CHAR_XXX### with XXX being some kind of unique identifier like a counter. Then, a file could be written that maps the unique identifiers to the original supplementary character. After all work with VTD is done, the placeholder would be replaced by the original character.
We have currently an internal VTD-XML version which I put together following the instructions of the VTD-XML author. After doing this, the Unicode jUnit test put up to prove the wrong behavior worked fine. I put together a new version of the julie-medline-manager using this version of VTD-XML and started importing of Medline XML from scratch. All pipelines should be updated to this version:
For julie-medline-manager:
<dependency>
<groupId>de.julielab</groupId>
<artifactId>julie-medline-manager</artifactId>
<version>1.1.0-SNAPSHOT</version>
</dependency>
If the julie-xml-tools are directly used, then:
<dependency>
<groupId>de.julielab</groupId>
<artifactId>julie-xml-tools</artifactId>
<version>0.3.2-SNAPSHOT</version>
</dependency>
We then have to check manually for the documents in question whether everything is fine, then.
@khituras What is the status here?
The newest verion of VTD has the fix included. This issue is fixed.