Rothamsted / knetbuilder

KnetBuilder data integration platform for building knowledge graphs. Previously known as ondex.
https://knetminer.com
MIT License
12 stars 11 forks source link

Medline parser error #19

Closed KeywanHP closed 5 years ago

KeywanHP commented 5 years ago

We have not yet managed to narrow it down to the PubMed Id that throws this exception:

https://github.com/Rothamsted/ondex-knet-builder/blob/4d8e129386e2f9340beb9f9bfaad1028ecf0f224/modules/textmining/src/main/java/net/sourceforge/ondex/parser/medline2/xml/XMLParser.java#L443

2019-06-13 09:56:57,133 [main] DEBUG net.sourceforge.ondex.parser.medline2.Parser - Error: PubMed/MEDLINE import did not finish! String ']]>' not allowed in textual content, except as the end marker of CDATA section at [row,col {unknown-source}]: [36442,570] [CLASS:net.sourceforge.ondex.parser.medline2.Parser - METHOD:start LINE:169] com.ctc.wstx.exc.WstxParsingException: String ']]>' not allowed in textual content, except as the end marker of CDATA section at [row,col {unknown-source}]: [36442,570] at com.ctc.wstx.sr.StreamScanner.constructWfcException(StreamScanner.java:612) at com.ctc.wstx.sr.StreamScanner.throwWfcException(StreamScanner.java:461) at com.ctc.wstx.sr.BasicStreamReader.readTextPrimary(BasicStreamReader.java:4544) at com.ctc.wstx.sr.BasicStreamReader.nextFromTree(BasicStreamReader.java:2842) at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1048) at com.ctc.wstx.sr.BasicStreamReader.getElementText(BasicStreamReader.java:653) at net.sourceforge.ondex.parser.medline2.xml.XMLParser.parseAbstract(XMLParser.java:443) at net.sourceforge.ondex.parser.medline2.xml.XMLParser.parseMedlineCitation(XMLParser.java:324) at net.sourceforge.ondex.parser.medline2.xml.XMLParser.parse(XMLParser.java:278) at net.sourceforge.ondex.parser.medline2.xml.XMLParser.lambda$parsePummedID$0(XMLParser.java:225) at uk.ac.ebi.utils.runcontrol.MultipleAttemptsExecutor.executeChecked(MultipleAttemptsExecutor.java:90) at net.sourceforge.ondex.parser.medline2.xml.XMLParser.parsePummedID(XMLParser.java:216) at net.sourceforge.ondex.parser.medline2.Parser.start(Parser.java:151) at net.sourceforge.ondex.workflow.engine.Engine.runParser(Engine.java:422) at net.sourceforge.ondex.workflow.engine.PluginProcessor$5.run(PluginProcessor.java:135) at net.sourceforge.ondex.workflow.engine.PluginProcessor$5.run(PluginProcessor.java:133) at net.sourceforge.ondex.workflow.engine.PluginProcessor.execute(PluginProcessor.java:83) at net.sourceforge.ondex.workflow.engine.BasicJobImpl.run(BasicJobImpl.java:110) at net.sourceforge.ondex.WorkflowMain.main(WorkflowMain.java:216) at net.sourceforge.ondex.OndexMiniMain.main(OndexMiniMain.java:7) 2019-06-13 09:56:57,137 [main] DEBUG net.sourceforge.ondex.workflow.engine.Engine - Medline/PubMed took 903 seconds [CLASS:net.sourceforge.ondex.workflow.engine.Engine - METHOD:runParser LINE:424]

marco-brandizi commented 5 years ago

@KeywanHP, please attach the input XML that fails to this ticket. From the message, I can only understand the problem should be caused by some abstract text, and this should contain a sequence like ]]>, which apparently the XML parser interprets as the closure of a never-opened CDATA section.

If that is the problem, a quick solution is to edit the input XML manually to replace ]]> with its escaped version: ]]>. If we have more than just a few cases, we need to do this automatically, wrapping the original InputStream with a filter, like it's done here.

Jakarta Commons has useful escaping functions.

KeywanHP commented 5 years ago

We are using the on the fly XML retrieval feature of the medline parser. So it's not possible to edit the XML and a proper fix is needed.

`

true
<Arg name="graphId">default</Arg>

`

https://github.com/Rothamsted/ondex-knet-builder/blob/4d8e129386e2f9340beb9f9bfaad1028ecf0f224/modules/textmining/src/main/java/net/sourceforge/ondex/parser/medline2/Parser.java#L138

marco-brandizi commented 5 years ago

Cause of the problem found, a summary follows.

Our CDATA wrapper is based on sed and line-by-line processing, so cases like the above aren't recognised as the begin of a tag. Later, the end of the same tag is instead recognised as the tag end and replaced with ]]></AbstractText (ie, the CDATA closure). This spawns wrong XML and the error at issue.

Solutions:

Just for the record, I'm attaching here the initial list of PMIDs from which I've isolated this case.