Slow DataLoader Performance

mstabile75 commented 8 years ago

I've bee trying various ways to load in a ~677 million triple .nt file with no luck. My last attempt was to use linux split to parse the files in 90k triples per file and run the DataLoader from command line. Everything seems fine at the start ~23k triples per sec loading. Then several hours later there is only a trickle of triples going through. I saw the same performance problems without splitting the file.

I'm using the latest blazegraph.jar file from sourceforge. Here is the command-line, properties file and partial terminal output.

Any help/suggestions would be appreciated. Thanks!

***************** command line *************************
java -cp *:*.jar com.bigdata.rdf.store.DataLoader -verbose -durableQueues /home/stabiledev/viaf/fastload.properties /home/stabiledev/viaf/batch &

****************** properties file **************************

# This configuration turns off incremental inference for load and retract, so
# you must explicitly force these operations if you want to compute the closure
# of the knowledge base.  Forcing the closure requires punching through the SAIL
# layer.  Of course, if you are not using inference then this configuration is
# just the ticket and is quite fast.

# set the journal file
com.bigdata.journal.AbstractJournal.file=/home/stabiledev/viaf/try2/viaf_data.jnl

# set the initial and maximum extent of the journal
com.bigdata.journal.AbstractJournal.initialExtent=209715200
com.bigdata.journal.AbstractJournal.maximumExtent=209715200

# turn off automatic inference in the SAIL
com.bigdata.rdf.sail.truthMaintenance=false

# don't store justification chains, meaning retraction requires full manual 
# re-closure of the database
com.bigdata.rdf.store.AbstractTripleStore.justify=false

# turn off the statement identifiers feature for provenance
com.bigdata.rdf.store.AbstractTripleStore.statementIdentifiers=false

# turn off the free text index
com.bigdata.rdf.store.AbstractTripleStore.textIndex=false

# RWStore (scalable single machine backend)
com.bigdata.journal.AbstractJournal.bufferMode=DiskRW

# Turn closure off
com.bigdata.rdf.store.DataLoader.closure=None

******************* teminal output ************************************

Reading properties: /home/stabiledev/viaf/fastload.properties
Will load from: /home/stabiledev/viaf/batch
Journal file: /home/stabiledev/viaf/try2/viaf_data.jnl
WARN : AbstractBTree.java:2167: Bloom filter disabled - maximum error rate would be exceeded: entryCount=1883228, factory=BloomFilterFactory{ n=1000000, p=0.02, maxP=0.15, maxN=1883227}
loading: 14130000 stmts added in 610.778 secs, rate= 23134, commitLatency=0ms, {failSet=0,goodSet=156}
loading: 22410000 stmts added in 1217.57 secs, rate= 18405, commitLatency=0ms, {failSet=0,goodSet=248}
loading: 28170000 stmts added in 1819.2 secs, rate= 15484, commitLatency=0ms, {failSet=0,goodSet=312}
loading: 34380000 stmts added in 2420.144 secs, rate= 14205, commitLatency=0ms, {failSet=0,goodSet=381}
loading: 38340000 stmts added in 3032.121 secs, rate= 12644, commitLatency=0ms, {failSet=0,goodSet=425}
WARN : AbstractBTree.java:3758: wrote: name=kb.spo.POS, 1 records (#nodes=1, #leaves=0) in 5211ms : addrRoot=-145370130275106556
loading: 43380000 stmts added in 3642.039 secs, rate= 11910, commitLatency=0ms, {failSet=0,goodSet=481}
WARN : AbstractBTree.java:3758: wrote: name=kb.spo.POS, 4 records (#nodes=1, #leaves=3) in 5247ms : addrRoot=-74144699035680380
WARN : AbstractBTree.java:3758: wrote: name=kb.spo.OSP, 1 records (#nodes=1, #leaves=0) in 5248ms : addrRoot=-74144703330647628
loading: 47070000 stmts added in 4261.653 secs, rate= 11045, commitLatency=0ms, {failSet=0,goodSet=522}
loading: 50670000 stmts added in 4866.663 secs, rate= 10411, commitLatency=0ms, {failSet=0,goodSet=562}

...

loading: 96570000 stmts added in 12807.511 secs, rate= 7540, commitLatency=0ms, {failSet=0,goodSet=1072}
loading: 100620000 stmts added in 13421.764 secs, rate= 7496, commitLatency=0ms, {failSet=0,goodSet=1117}
WARN : AbstractBTree.java:3758: wrote: name=kb.spo.POS, 7 records (#nodes=3, #leaves=4) in 5026ms : addrRoot=-620060906650337000
WARN : AbstractBTree.java:3758: wrote: name=kb.spo.OSP, 3 records (#nodes=1, #leaves=2) in 6001ms : addrRoot=-701707543457562350
WARN : AbstractBTree.java:3758: wrote: name=kb.spo.POS, 9 records (#nodes=5, #leaves=4) in 6001ms : addrRoot=-127697761486241280
loading: 105750000 stmts added in 14024.126 secs, rate= 7540, commitLatency=0ms, {failSet=0,goodSet=1174}

...

loading: 144720000 stmts added in 28388.47 secs, rate= 5097, commitLatency=0ms, {failSet=0,goodSet=1607}
loading: 145530000 stmts added in 29031.825 secs, rate= 5012, commitLatency=0ms, {failSet=0,goodSet=1616}

...

loading: 171360000 stmts added in 57245.59 secs, rate= 2993, commitLatency=0ms, {failSet=0,goodSet=1903}
loading: 171810000 stmts added in 57962.026 secs, rate= 2964, commitLatency=0ms, {failSet=0,goodSet=1908}

beebs-systap commented 8 years ago

@mstabile75 One thing to try is to configure the branching factors based one a partial load of the data. It's very possible that the shape of your data is causing some inefficiencies in the underlying storage.

Try loading part of your data, running the DumpJournal, and then updating the properties file with the new branching factors and reloading.

https://wiki.blazegraph.com/wiki/index.php/IOOptimization#Branching_Factors

mstabile75 commented 8 years ago

Thanks, I'll try this out this week and post the results.

KMax commented 7 years ago

Hi @mstabile75, did the branching factors helped you? Thanks!

KMax commented 7 years ago

Okay, I tried to configure the branching factors based on the output of com.bigdata.journal.DumpJournal, but the speed (triples/sec) was even worse comparing to another run only with the global com.bigdata.btree.BTree.branchingFactor=256.

I tried to load ~2bln triples using a VM with 4xCPU, 26Gb, 700Gb Local SSD.

Did I set the branching factors correctly? Is there anything that could minimize the effect from the custom branching factors?

This is the properties:

# changing the axiom model to none essentially disables all inference
com.bigdata.rdf.store.AbstractTripleStore.axiomsClass=com.bigdata.rdf.axioms.NoAxioms
com.bigdata.rdf.store.AbstractTripleStore.quads=true
com.bigdata.rdf.store.AbstractTripleStore.statementIdentifiers=false

com.bigdata.rdf.store.AbstractTripleStore.geoSpatial=false
com.bigdata.rdf.sail.truthMaintenance=false
com.bigdata.rdf.store.AbstractTripleStore.textIndex=false
com.bigdata.rdf.store.AbstractTripleStore.justify=false

# RWStore (scalable single machine backend)
com.bigdata.journal.AbstractJournal.bufferMode=DiskRW
com.bigdata.journal.AbstractJournal.file=/blazegraph/db/bigdata.jnl
com.bigdata.journal.AbstractJournal.writeCacheBufferCount=2000

# Enable small slot optimization.
com.bigdata.rwstore.RWStore.smallSlotType=1024
# Set the default B+Tree branching factor.
com.bigdata.btree.BTree.branchingFactor=256
com.bigdata.namespace.__globalRowStore.com.bigdata.btree.BTree.branchingFactor=592
com.bigdata.namespace.kb.lex.BLOBS.com.bigdata.btree.BTree.branchingFactor=2109
com.bigdata.namespace.kb.lex.ID2TERM.com.bigdata.btree.BTree.branchingFactor=903
com.bigdata.namespace.kb.lex.TERM2ID.com.bigdata.btree.BTree.branchingFactor=367
com.bigdata.namespace.kb.lex.search.com.bigdata.btree.BTree.branchingFactor=517
com.bigdata.namespace.kb.spo.CSPO.com.bigdata.btree.BTree.branchingFactor=731
com.bigdata.namespace.kb.spo.OCSP.com.bigdata.btree.BTree.branchingFactor=667
com.bigdata.namespace.kb.spo.PCSO.com.bigdata.btree.BTree.branchingFactor=864
com.bigdata.namespace.kb.spo.POCS.com.bigdata.btree.BTree.branchingFactor=816
com.bigdata.namespace.kb.spo.SOPC.com.bigdata.btree.BTree.branchingFactor=630
com.bigdata.namespace.kb.spo.SPOC.com.bigdata.btree.BTree.branchingFactor=604
# Set the default B+Tree retention queue capacity.
com.bigdata.btree.writeRetentionQueue.capacity=4000

blazegraph / database

Slow DataLoader Performance #9