Medline parser error - Githubissues

KeywanHP commented 5 years ago

We have not yet managed to narrow it down to the PubMed Id that throws this exception:

https://github.com/Rothamsted/ondex-knet-builder/blob/4d8e129386e2f9340beb9f9bfaad1028ecf0f224/modules/textmining/src/main/java/net/sourceforge/ondex/parser/medline2/xml/XMLParser.java#L443

2019-06-13 09:56:57,133 [main] DEBUG net.sourceforge.ondex.parser.medline2.Parser - Error: PubMed/MEDLINE import did not finish! String ']]>' not allowed in textual content, except as the end marker of CDATA section at [row,col {unknown-source}]: [36442,570] [CLASS:net.sourceforge.ondex.parser.medline2.Parser - METHOD:start LINE:169] com.ctc.wstx.exc.WstxParsingException: String ']]>' not allowed in textual content, except as the end marker of CDATA section at [row,col {unknown-source}]: [36442,570] at com.ctc.wstx.sr.StreamScanner.constructWfcException(StreamScanner.java:612) at com.ctc.wstx.sr.StreamScanner.throwWfcException(StreamScanner.java:461) at com.ctc.wstx.sr.BasicStreamReader.readTextPrimary(BasicStreamReader.java:4544) at com.ctc.wstx.sr.BasicStreamReader.nextFromTree(BasicStreamReader.java:2842) at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1048) at com.ctc.wstx.sr.BasicStreamReader.getElementText(BasicStreamReader.java:653) at net.sourceforge.ondex.parser.medline2.xml.XMLParser.parseAbstract(XMLParser.java:443) at net.sourceforge.ondex.parser.medline2.xml.XMLParser.parseMedlineCitation(XMLParser.java:324) at net.sourceforge.ondex.parser.medline2.xml.XMLParser.parse(XMLParser.java:278) at net.sourceforge.ondex.parser.medline2.xml.XMLParser.lambda$parsePummedID$0(XMLParser.java:225) at uk.ac.ebi.utils.runcontrol.MultipleAttemptsExecutor.executeChecked(MultipleAttemptsExecutor.java:90) at net.sourceforge.ondex.parser.medline2.xml.XMLParser.parsePummedID(XMLParser.java:216) at net.sourceforge.ondex.parser.medline2.Parser.start(Parser.java:151) at net.sourceforge.ondex.workflow.engine.Engine.runParser(Engine.java:422) at net.sourceforge.ondex.workflow.engine.PluginProcessor$5.run(PluginProcessor.java:135) at net.sourceforge.ondex.workflow.engine.PluginProcessor$5.run(PluginProcessor.java:133) at net.sourceforge.ondex.workflow.engine.PluginProcessor.execute(PluginProcessor.java:83) at net.sourceforge.ondex.workflow.engine.BasicJobImpl.run(BasicJobImpl.java:110) at net.sourceforge.ondex.WorkflowMain.main(WorkflowMain.java:216) at net.sourceforge.ondex.OndexMiniMain.main(OndexMiniMain.java:7) 2019-06-13 09:56:57,137 [main] DEBUG net.sourceforge.ondex.workflow.engine.Engine - Medline/PubMed took 903 seconds [CLASS:net.sourceforge.ondex.workflow.engine.Engine - METHOD:runParser LINE:424]

marco-brandizi commented 5 years ago

@KeywanHP, please attach the input XML that fails to this ticket. From the message, I can only understand the problem should be caused by some abstract text, and this should contain a sequence like ]]>, which apparently the XML parser interprets as the closure of a never-opened CDATA section.

If that is the problem, a quick solution is to edit the input XML manually to replace ]]> with its escaped version: ]]>. If we have more than just a few cases, we need to do this automatically, wrapping the original InputStream with a filter, like it's done here.

Jakarta Commons has useful escaping functions.

KeywanHP commented 5 years ago

We are using the on the fly XML retrieval feature of the medline parser. So it's not possible to edit the XML and a proper fix is needed.

`

true

<Arg name="graphId">default</Arg>

`

https://github.com/Rothamsted/ondex-knet-builder/blob/4d8e129386e2f9340beb9f9bfaad1028ecf0f224/modules/textmining/src/main/java/net/sourceforge/ondex/parser/medline2/Parser.java#L138

marco-brandizi commented 5 years ago

Cause of the problem found, a summary follows.

E-Fetch returns abstacts with multi-line attribute values:
```
...
<AbstractText Label="
METHODS
" NlmCategory="UNASSIGNED">We used Arabidopsis plants...</AbstractText>
```
For instance, this currently happens for PMID:30535180

Our CDATA wrapper is based on sed and line-by-line processing, so cases like the above aren't recognised as the begin of a tag. Later, the end of the same tag is instead recognised as the tag end and replaced with ]]></AbstractText (ie, the CDATA closure). This spawns wrong XML and the error at issue.

Solutions:

For the moment, we give up with parsing the only PMID where this happens
We'll write to NCBI people to ask if they could fix those attributes, since it doesn't make sense that they have line breaks
Maybe something can be done via xmllint

Just for the record, I'm attaching here the initial list of PMIDs from which I've isolated this case.

Rothamsted / knetbuilder

Medline parser error #19