Open EllysahatNMI opened 8 years ago
I also noticed this problem at some point but thought of it as a dump problem
I see a few references around like https://github.com/elastic/stream2es/issues/65 https://github.com/Wikidata/Wikidata-Toolkit/issues/243 https://github.com/Wikidata/Wikidata-Toolkit/pull/244
can you test the following? Edit the run script and add the argument described here and see if it works? https://github.com/elastic/stream2es/issues/65#issuecomment-250917529
I already tried putting jdk.xml.totalEntitySizeLimit and totalEntitySizeLimit as indicated in elastic/stream2es#65 (comment) but I still get the same error. It stops when the import is around 600,000+ pages already.
I edited the clean-install-run script like this:
mvn -Djdk.xml.totalEntitySizeLimit=2147480000 $BATCH -f ../pom.xml clean && . ../install-run "$@"
I'm out of ideas. Please help.
can you edit the run
script?
clean-install-run
calls install-run
and that calls run
in the end
Hi. Sorry I'm a little confused on where I should put it in the run script.
This is how I did it: mvn $MAVEN_DEBUG $BATCH scala:run "-Dlauncher=$LAUNCHER" "-DaddArgs=$ADD_ARGS" "-Djdk.xml.totalEntitySizeLimit=2147480000"
Is it correct?
this is the line:
https://github.com/dbpedia/extraction-framework/blob/master/run#L45
not sure if the package is needed so maybe add both
-Djdk.xml.totalEntitySizeLimit=2147480000 -DtotalEntitySizeLimit=214748000 0"
I tried doing as you said but same error appears.
I edited https://github.com/dbpedia/extraction-framework/blob/master/run#L45 like this: mvn $MAVEN_DEBUG -DtotalEntitySizeLimit=2147480000 -Djdk.xml.totalEntitySizeLimit=2147480000 $BATCH scala:run "-Dlauncher=$LAUNCHER" "-DaddArgs=$ADD_ARGS"
sorry again, let's make a final check
try putting this here: https://github.com/dbpedia/extraction-framework/blob/master/dump/pom.xml#L64-L71
this is a JVM argument and we should either pass it from the pom.xml
or MAVEN_OPTS
not sure if directly from maven would work
Still the same error when I put this:
<jvmArg>-Djdk.xml.totalEntitySizeLimit=2147480000</jvmArg>
I put it in this line: https://github.com/dbpedia/extraction-framework/blob/master/dump/pom.xml#L39-L41
Here's the snippet:
<jvmArgs>
<jvmArg>-server</jvmArg>
<jvmArg>-DtotalEntitySizeLimit=2147480000</jvmArg>
<jvmArg>-Djdk.xml.totalEntitySizeLimit=2147480000</jvmArg>
</jvmArgs>
And it worked! Thank you so much! :)
Great! can you do a final check to see which of the two arguments is the one needed? You can either tell us here or make a PR directly
Thanks!
Thanks guys! This problem also popped up when extracting abstracts.
I'm trying to use the 'downloa.10000.properties' and then 'extraction.default.properties'
I'm receiving the same error.
Message: JAXP00010004: The accumulated size of entities is "50,000,001" that exceeded the "50,000,000" limit set by "FEATURE_SECURE_PROCESSING".
I've tried the solutions above, I set extraction-framework/dump/pom.xml's jvmArgs
<mainClass>org.dbpedia.extraction.dump.sql.Import</mainClass>
<jvmArgs>
<jvmArg>-server</jvmArg>
<jvmArg>-DtotalEntitySizeLimit=0</jvmArg>
<jvmArg>-Djdk.xml.totalEntitySizeLimit=0</jvmArg>
</jvmArgs>
No luck fixing it. I then tried modifying ../run's script
if [[ $SLACK != false && $SlackUrl == https://hooks.slack.com/services* ]] ;
then
#mvn $MAVEN_DEBUG $BATCH scala:run "-Dlauncher=SlackForwarder" "-DaddArgs=$SlackUrl|$SlackRegexMap|$LogDir|$1|$SLACK" &
sleep 5
PID="$(ps ax | grep java | grep extraction.scripts.SlackForwarder | tail -1 | sed -n -E 's/([0-9]+).*/\1/p' | xargs)"
echo $PID
mvn $MAVEN_DEBUG -B scala:run "-Djdk.xml.totalEntitySizeLimit=0" "-Dlauncher=$LAUNCHER" "-DaddArgs=$ADD_ARGS" &> "/proc/$PID/fd/0"
else
mvn $MAVEN_DEBUG $BATCH scala:run "-Djdk.xml.totalEntitySizeLimit=0" "-Dlauncher=$LAUNCHER" "-DaddArgs=$ADD_ARGS"
fi
You'll notice I added "-Djdk.xml.totalEntitySizeLimit=0". According to the docs, 0 is supposed to set it to unlimited. I did try it with the limits you listed above, that didn't work either.
Interesting, I just ran an import over 130 languages without a hitch. Here is the launcher setting I used:
<launcher>
<id>import</id>
<mainClass>org.dbpedia.extraction.dump.sql.Import</mainClass>
<jvmArgs>
<jvmArg>-server</jvmArg>
<jvmArg>-DtotalEntitySizeLimit=2147480000</jvmArg>
<jvmArg>-Djdk.xml.totalEntitySizeLimit=2147480000</jvmArg>
</jvmArgs>
<args>
<!-- base folder of downloaded dumps -->
<arg>/data/extraction-data/2016-04</arg>
<!-- location of SQL file containing MediaWiki table definitions -->
<arg>/home/extractor/mediawikiTables.sql</arg>
<!-- JDBC URL of MySQL server. Import creates a new database for
each wiki. -->
<arg>jdbc:mysql://localhost/?characterEncoding=UTF-8&user=root</arg>
<!-- require-download-complete -->
<arg>true</arg>
<!-- file name: pages-articles.xml{,.bz2,.gz} -->
<arg>pages-articles.xml.bz2</arg>
<!-- number of parallel imports; this number depends on the number of processors in use
and the type of hard disc (hhd/ssd) and how many parallel file reads it can support -->
<arg>16</arg>
<!-- languages and article count ranges, comma-separated, e.g. "de,en"
or "@mappings" etc. -->
<arg>@downloaded</arg>
</args>
</launcher>
Was it Java 7 or 8? I was using Java 8 and I was reading that this was added to Java in 8. I'm not testing with Java 7.
Using Java 8 as well, is this still causing problems with you?
@chile12 I'm still processing all of Wiki (2 days later), but it's working no issues. I switched to Java 7.
Hi guys. Sorry for the late update.
So I just confirmed that this argument is the correct one:
<launcher>
<id>import</id>
<mainClass>org.dbpedia.extraction.dump.sql.Import</mainClass>
<jvmArgs>
<jvmArg>-server</jvmArg>
<jvmArg>-Djdk.xml.totalEntitySizeLimit=2147480000</jvmArg>
</jvmArgs>
<args>
<!-- base folder of downloaded dumps -->
<arg>/data/extraction-data/2016-04</arg>
<!-- location of SQL file containing MediaWiki table definitions -->
<arg>/home/extractor/mediawikiTables.sql</arg>
<!-- JDBC URL of MySQL server. Import creates a new database for
each wiki. -->
<arg>jdbc:mysql://localhost/?characterEncoding=UTF-8&user=root</arg>
<!-- require-download-complete -->
<arg>true</arg>
<!-- file name: pages-articles.xml{,.bz2,.gz} -->
<arg>pages-articles.xml.bz2</arg>
<!-- number of parallel imports; this number depends on the number of processors in use
and the type of hard disc (hhd/ssd) and how many parallel file reads it can support -->
<arg>16</arg>
<!-- languages and article count ranges, comma-separated, e.g. "de,en"
or "@mappings" etc. -->
<arg>@downloaded</arg>
</args>
</launcher>
I ran into this problem with the Wikidata extractor, having these settings
<jvmArgs>
<jvmArg>-server</jvmArg>
<jvmArg>-DtotalEntitySizeLimit=2147480000</jvmArg>
<jvmArg>-Djdk.xml.totalEntitySizeLimit=2147480000</jvmArg>
</jvmArgs>
It is possible to remove this limitation by setting the value to 0. The extractor runs fine now. http://www.ibm.com/support/knowledgecenter/SSYKE2_7.0.0/com.ibm.java.aix.70.doc/diag/appendixes/cmdline/Djdkxmltotalentitysizelimit.html
Thanks for the update Roland, best to integrate this into all the POM files.
Can anyone tell me the final solution. I am still stuck in this and tried all the solutions but no luck. thanks
Hi! I followed the step by step instruction but encountered this error after running ../clean-install-run I used the enwiki-20161001-pages-articles.xml.bz2 and I know it's a huge file but how do I get away with this error? I tried putting the -DentityExpansionLimit=2147480000 in the clean-install-run like this: mvn -DentityExpansionLimit=2147480000 ... but I still get the same error. Please help me.