dbpedia / extraction-framework

The software used to extract structured data from Wikipedia
860 stars 270 forks source link

The accumulated size of entities is "50,000,001" that exceeded the "50,000,000" limit set by "FEATURE_SECURE_PROCESSING" #487

Open EllysahatNMI opened 8 years ago

EllysahatNMI commented 8 years ago

Hi! I followed the step by step instruction but encountered this error after running ../clean-install-run I used the enwiki-20161001-pages-articles.xml.bz2 and I know it's a huge file but how do I get away with this error? I tried putting the -DentityExpansionLimit=2147480000 in the clean-install-run like this: mvn -DentityExpansionLimit=2147480000 ... but I still get the same error. Please help me.

jimkont commented 8 years ago

I also noticed this problem at some point but thought of it as a dump problem

I see a few references around like https://github.com/elastic/stream2es/issues/65 https://github.com/Wikidata/Wikidata-Toolkit/issues/243 https://github.com/Wikidata/Wikidata-Toolkit/pull/244

can you test the following? Edit the run script and add the argument described here and see if it works? https://github.com/elastic/stream2es/issues/65#issuecomment-250917529

EllysahatNMI commented 8 years ago

I already tried putting jdk.xml.totalEntitySizeLimit and totalEntitySizeLimit as indicated in elastic/stream2es#65 (comment) but I still get the same error. It stops when the import is around 600,000+ pages already.

I edited the clean-install-run script like this:

mvn -Djdk.xml.totalEntitySizeLimit=2147480000 $BATCH -f ../pom.xml clean && . ../install-run "$@"

I'm out of ideas. Please help.

jimkont commented 8 years ago

can you edit the run script? clean-install-run calls install-run and that calls run in the end

EllysahatNMI commented 8 years ago

Hi. Sorry I'm a little confused on where I should put it in the run script.

This is how I did it: mvn $MAVEN_DEBUG $BATCH scala:run "-Dlauncher=$LAUNCHER" "-DaddArgs=$ADD_ARGS" "-Djdk.xml.totalEntitySizeLimit=2147480000"

Is it correct?

jimkont commented 8 years ago

this is the line: https://github.com/dbpedia/extraction-framework/blob/master/run#L45 not sure if the package is needed so maybe add both -Djdk.xml.totalEntitySizeLimit=2147480000 -DtotalEntitySizeLimit=214748000 0"

EllysahatNMI commented 8 years ago

I tried doing as you said but same error appears.

I edited https://github.com/dbpedia/extraction-framework/blob/master/run#L45 like this: mvn $MAVEN_DEBUG -DtotalEntitySizeLimit=2147480000 -Djdk.xml.totalEntitySizeLimit=2147480000 $BATCH scala:run "-Dlauncher=$LAUNCHER" "-DaddArgs=$ADD_ARGS"

jimkont commented 8 years ago

sorry again, let's make a final check

try putting this here: https://github.com/dbpedia/extraction-framework/blob/master/dump/pom.xml#L64-L71 this is a JVM argument and we should either pass it from the pom.xml or MAVEN_OPTS not sure if directly from maven would work

EllysahatNMI commented 8 years ago

Still the same error when I put this:

-DtotalEntitySizeLimit=2147480000
     <jvmArg>-Djdk.xml.totalEntitySizeLimit=2147480000</jvmArg>
EllysahatNMI commented 8 years ago

I put it in this line: https://github.com/dbpedia/extraction-framework/blob/master/dump/pom.xml#L39-L41

Here's the snippet:

 <jvmArgs>
     <jvmArg>-server</jvmArg>
     <jvmArg>-DtotalEntitySizeLimit=2147480000</jvmArg>
     <jvmArg>-Djdk.xml.totalEntitySizeLimit=2147480000</jvmArg>
</jvmArgs>

And it worked! Thank you so much! :)

jimkont commented 8 years ago

Great! can you do a final check to see which of the two arguments is the one needed? You can either tell us here or make a PR directly

Thanks!

chile12 commented 8 years ago

Thanks guys! This problem also popped up when extracting abstracts.

clanstyles commented 8 years ago

I'm trying to use the 'downloa.10000.properties' and then 'extraction.default.properties'

I'm receiving the same error.

Message: JAXP00010004: The accumulated size of entities is "50,000,001" that exceeded the "50,000,000" limit set by "FEATURE_SECURE_PROCESSING".

I've tried the solutions above, I set extraction-framework/dump/pom.xml's jvmArgs

                                                        <mainClass>org.dbpedia.extraction.dump.sql.Import</mainClass>
                                                        <jvmArgs>
                                                                <jvmArg>-server</jvmArg>
                                                                <jvmArg>-DtotalEntitySizeLimit=0</jvmArg>
                                                                <jvmArg>-Djdk.xml.totalEntitySizeLimit=0</jvmArg>
                                                        </jvmArgs>

No luck fixing it. I then tried modifying ../run's script

if [[ $SLACK != false && $SlackUrl == https://hooks.slack.com/services* ]] ;
then
  #mvn $MAVEN_DEBUG $BATCH scala:run "-Dlauncher=SlackForwarder"  "-DaddArgs=$SlackUrl|$SlackRegexMap|$LogDir|$1|$SLACK" &
  sleep 5
  PID="$(ps ax | grep java | grep extraction.scripts.SlackForwarder | tail -1 | sed -n -E 's/([0-9]+).*/\1/p' | xargs)"
  echo $PID
  mvn $MAVEN_DEBUG -B scala:run "-Djdk.xml.totalEntitySizeLimit=0" "-Dlauncher=$LAUNCHER" "-DaddArgs=$ADD_ARGS" &> "/proc/$PID/fd/0"
else
  mvn $MAVEN_DEBUG $BATCH scala:run "-Djdk.xml.totalEntitySizeLimit=0" "-Dlauncher=$LAUNCHER" "-DaddArgs=$ADD_ARGS"
fi

You'll notice I added "-Djdk.xml.totalEntitySizeLimit=0". According to the docs, 0 is supposed to set it to unlimited. I did try it with the limits you listed above, that didn't work either.

chile12 commented 8 years ago

Interesting, I just ran an import over 130 languages without a hitch. Here is the launcher setting I used:

                    <launcher>
                        <id>import</id>
                        <mainClass>org.dbpedia.extraction.dump.sql.Import</mainClass>
                        <jvmArgs>
                            <jvmArg>-server</jvmArg>
                            <jvmArg>-DtotalEntitySizeLimit=2147480000</jvmArg>
                            <jvmArg>-Djdk.xml.totalEntitySizeLimit=2147480000</jvmArg>
                        </jvmArgs>
                        <args>
                            <!-- base folder of downloaded dumps -->
                            <arg>/data/extraction-data/2016-04</arg>
                            <!-- location of SQL file containing MediaWiki table definitions -->
                            <arg>/home/extractor/mediawikiTables.sql</arg>
                            <!-- JDBC URL of MySQL server. Import creates a new database for 
                                each wiki. -->
                            <arg>jdbc:mysql://localhost/?characterEncoding=UTF-8&amp;user=root</arg>
                            <!-- require-download-complete -->
                            <arg>true</arg> 
                            <!-- file name: pages-articles.xml{,.bz2,.gz} -->
                            <arg>pages-articles.xml.bz2</arg>
                            <!-- number of parallel imports; this number depends on the number of processors in use
                                and the type of hard disc (hhd/ssd) and how many parallel file reads it can support -->
                            <arg>16</arg>
                            <!-- languages and article count ranges, comma-separated, e.g. "de,en" 
                                or "@mappings" etc. -->
                            <arg>@downloaded</arg>
                        </args>
                    </launcher>
clanstyles commented 8 years ago

Was it Java 7 or 8? I was using Java 8 and I was reading that this was added to Java in 8. I'm not testing with Java 7.

chile12 commented 8 years ago

Using Java 8 as well, is this still causing problems with you?

clanstyles commented 8 years ago

@chile12 I'm still processing all of Wiki (2 days later), but it's working no issues. I switched to Java 7.

EllysahatNMI commented 8 years ago

Hi guys. Sorry for the late update.

So I just confirmed that this argument is the correct one:

-Djdk.xml.totalEntitySizeLimit=2147480000
<launcher>
                        <id>import</id>
                        <mainClass>org.dbpedia.extraction.dump.sql.Import</mainClass>
                        <jvmArgs>
                            <jvmArg>-server</jvmArg>
                            <jvmArg>-Djdk.xml.totalEntitySizeLimit=2147480000</jvmArg>
                        </jvmArgs>
                        <args>
                            <!-- base folder of downloaded dumps -->
                            <arg>/data/extraction-data/2016-04</arg>
                            <!-- location of SQL file containing MediaWiki table definitions -->
                            <arg>/home/extractor/mediawikiTables.sql</arg>
                            <!-- JDBC URL of MySQL server. Import creates a new database for 
                                each wiki. -->
                            <arg>jdbc:mysql://localhost/?characterEncoding=UTF-8&amp;user=root</arg>
                            <!-- require-download-complete -->
                            <arg>true</arg> 
                            <!-- file name: pages-articles.xml{,.bz2,.gz} -->
                            <arg>pages-articles.xml.bz2</arg>
                            <!-- number of parallel imports; this number depends on the number of processors in use
                                and the type of hard disc (hhd/ssd) and how many parallel file reads it can support -->
                            <arg>16</arg>
                            <!-- languages and article count ranges, comma-separated, e.g. "de,en" 
                                or "@mappings" etc. -->
                            <arg>@downloaded</arg>
                        </args>
                    </launcher>
roland-c commented 7 years ago

I ran into this problem with the Wikidata extractor, having these settings

<jvmArgs>
     <jvmArg>-server</jvmArg>
     <jvmArg>-DtotalEntitySizeLimit=2147480000</jvmArg>
     <jvmArg>-Djdk.xml.totalEntitySizeLimit=2147480000</jvmArg>
</jvmArgs>

It is possible to remove this limitation by setting the value to 0. The extractor runs fine now. http://www.ibm.com/support/knowledgecenter/SSYKE2_7.0.0/com.ibm.java.aix.70.doc/diag/appendixes/cmdline/Djdkxmltotalentitysizelimit.html

chile12 commented 7 years ago

Thanks for the update Roland, best to integrate this into all the POM files.

manzoorali29 commented 5 years ago

Can anyone tell me the final solution. I am still stuck in this and tried all the solutions but no luck. thanks