Closed KeywanHP closed 5 years ago
@KeywanHP, please attach the input XML that fails to this ticket. From the message, I can only understand the problem should be caused by some abstract text, and this should contain a sequence like ]]>
, which apparently the XML parser interprets as the closure of a never-opened CDATA
section.
If that is the problem, a quick solution is to edit the input XML manually to replace ]]>
with its escaped version: ]]>
. If we have more than just a few cases, we need to do this automatically, wrapping the original InputStream
with a filter, like it's done here.
Jakarta Commons has useful escaping functions.
We are using the on the fly XML retrieval feature of the medline parser. So it's not possible to edit the XML and a proper fix is needed.
`
<Arg name="graphId">default</Arg>
`
Cause of the problem found, a summary follows.
...
<AbstractText Label="
METHODS
" NlmCategory="UNASSIGNED">We used Arabidopsis plants...</AbstractText>
For instance, this currently happens for PMID:30535180
Our CDATA wrapper is based on sed and line-by-line processing, so cases like the above aren't recognised as the begin of a tag. Later, the end of the same tag is instead recognised as the tag end and replaced with ]]></AbstractText
(ie, the CDATA closure). This spawns wrong XML and the error at issue.
Solutions:
Just for the record, I'm attaching here the initial list of PMIDs from which I've isolated this case.
We have not yet managed to narrow it down to the PubMed Id that throws this exception:
https://github.com/Rothamsted/ondex-knet-builder/blob/4d8e129386e2f9340beb9f9bfaad1028ecf0f224/modules/textmining/src/main/java/net/sourceforge/ondex/parser/medline2/xml/XMLParser.java#L443
2019-06-13 09:56:57,133 [main] DEBUG net.sourceforge.ondex.parser.medline2.Parser - Error: PubMed/MEDLINE import did not finish! String ']]>' not allowed in textual content, except as the end marker of CDATA section at [row,col {unknown-source}]: [36442,570] [CLASS:net.sourceforge.ondex.parser.medline2.Parser - METHOD:start LINE:169] com.ctc.wstx.exc.WstxParsingException: String ']]>' not allowed in textual content, except as the end marker of CDATA section at [row,col {unknown-source}]: [36442,570] at com.ctc.wstx.sr.StreamScanner.constructWfcException(StreamScanner.java:612) at com.ctc.wstx.sr.StreamScanner.throwWfcException(StreamScanner.java:461) at com.ctc.wstx.sr.BasicStreamReader.readTextPrimary(BasicStreamReader.java:4544) at com.ctc.wstx.sr.BasicStreamReader.nextFromTree(BasicStreamReader.java:2842) at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1048) at com.ctc.wstx.sr.BasicStreamReader.getElementText(BasicStreamReader.java:653) at net.sourceforge.ondex.parser.medline2.xml.XMLParser.parseAbstract(XMLParser.java:443) at net.sourceforge.ondex.parser.medline2.xml.XMLParser.parseMedlineCitation(XMLParser.java:324) at net.sourceforge.ondex.parser.medline2.xml.XMLParser.parse(XMLParser.java:278) at net.sourceforge.ondex.parser.medline2.xml.XMLParser.lambda$parsePummedID$0(XMLParser.java:225) at uk.ac.ebi.utils.runcontrol.MultipleAttemptsExecutor.executeChecked(MultipleAttemptsExecutor.java:90) at net.sourceforge.ondex.parser.medline2.xml.XMLParser.parsePummedID(XMLParser.java:216) at net.sourceforge.ondex.parser.medline2.Parser.start(Parser.java:151) at net.sourceforge.ondex.workflow.engine.Engine.runParser(Engine.java:422) at net.sourceforge.ondex.workflow.engine.PluginProcessor$5.run(PluginProcessor.java:135) at net.sourceforge.ondex.workflow.engine.PluginProcessor$5.run(PluginProcessor.java:133) at net.sourceforge.ondex.workflow.engine.PluginProcessor.execute(PluginProcessor.java:83) at net.sourceforge.ondex.workflow.engine.BasicJobImpl.run(BasicJobImpl.java:110) at net.sourceforge.ondex.WorkflowMain.main(WorkflowMain.java:216) at net.sourceforge.ondex.OndexMiniMain.main(OndexMiniMain.java:7) 2019-06-13 09:56:57,137 [main] DEBUG net.sourceforge.ondex.workflow.engine.Engine - Medline/PubMed took 903 seconds [CLASS:net.sourceforge.ondex.workflow.engine.Engine - METHOD:runParser LINE:424]