dkpro / dkpro-jwpl

DKPro JWPL (DKPro Java Wikipedia Library) is a free, Java-based application programming interface that facilitates access to all information in Wikipedia.
https://dkpro.github.io/dkpro-jwpl
Apache License 2.0
82 stars 34 forks source link

The entity number exceed #154

Closed keepRunning2017 closed 6 years ago

keepRunning2017 commented 6 years ago

When I used de.tudarmstadt.ukp.wikipedia.datamachine-1.0.0-jar-with-dependencies to parse chinese wiki dumps, it reported that the entity number exceeded 50000000. Could anyone solve this problem?

daxenberger commented 6 years ago

Which "entity number"? Can you post the entire error message or give more details?

ZiyaoLu commented 6 years ago

I think I met the same problem with keepRunning2017, I failed to use either "de.tudarmstadt.ukp.wikipedia.datamachine-1.0.0-jar-with-dependencies.jar" or "de.tudarmstadt.ukp.wikipedia.datamachine-1.1.0-jar-with-dependencies.jar" on the latest english wiki data. I have no idea how to solve this. Here is my entire error message: ######################################################## "Date/Time","Total Memory","Free Memory","Message" "2018.04.22 00:07:27","2058354688","1983119176","parse input dumps..." "2018.04.22 00:07:27","2058354688","1983119176","Discussions are unavailable" _"2018.04.22 00:09:45","1853358080","1830626552","org.xml.sax.SAXParseException; lineNumber: 61983422; columnNumber: 164; JAXP00010004: The accumulated size of entities is "50,000,001" that exceeded the "50,000,000" limit set by "FEATURE_SECURE_PROCESSING"._

de.tudarmstadt.ukp.wikipedia.wikimachine.dump.xml.AbstractXmlDumpReader.readDump(AbstractXmlDumpReader.java:209) de.tudarmstadt.ukp.wikipedia.datamachine.dump.xml.XML2Binary.(XML2Binary.java:47) de.tudarmstadt.ukp.wikipedia.datamachine.domain.DataMachineGenerator.processInputDump(DataMachineGenerator.java:70) de.tudarmstadt.ukp.wikipedia.datamachine.domain.DataMachineGenerator.start(DataMachineGenerator.java:64) de.tudarmstadt.ukp.wikipedia.datamachine.domain.JWPLDataMachine.main(JWPLDataMachine.java:64)" ########################################################

zesch commented 6 years ago

This is a known issue - see here: https://github.com/dkpro/dkpro-jwpl/issues/144