Open mstabile75 opened 8 years ago
@mstabile75 One thing to try is to configure the branching factors based one a partial load of the data. It's very possible that the shape of your data is causing some inefficiencies in the underlying storage.
Try loading part of your data, running the DumpJournal, and then updating the properties file with the new branching factors and reloading.
https://wiki.blazegraph.com/wiki/index.php/IOOptimization#Branching_Factors
Thanks, I'll try this out this week and post the results.
Hi @mstabile75, did the branching factors helped you? Thanks!
Okay, I tried to configure the branching factors based on the output of com.bigdata.journal.DumpJournal
, but the speed (triples/sec) was even worse comparing to another run only with the global com.bigdata.btree.BTree.branchingFactor=256
.
I tried to load ~2bln triples using a VM with 4xCPU, 26Gb, 700Gb Local SSD.
Did I set the branching factors correctly? Is there anything that could minimize the effect from the custom branching factors?
This is the properties:
# changing the axiom model to none essentially disables all inference
com.bigdata.rdf.store.AbstractTripleStore.axiomsClass=com.bigdata.rdf.axioms.NoAxioms
com.bigdata.rdf.store.AbstractTripleStore.quads=true
com.bigdata.rdf.store.AbstractTripleStore.statementIdentifiers=false
com.bigdata.rdf.store.AbstractTripleStore.geoSpatial=false
com.bigdata.rdf.sail.truthMaintenance=false
com.bigdata.rdf.store.AbstractTripleStore.textIndex=false
com.bigdata.rdf.store.AbstractTripleStore.justify=false
# RWStore (scalable single machine backend)
com.bigdata.journal.AbstractJournal.bufferMode=DiskRW
com.bigdata.journal.AbstractJournal.file=/blazegraph/db/bigdata.jnl
com.bigdata.journal.AbstractJournal.writeCacheBufferCount=2000
# Enable small slot optimization.
com.bigdata.rwstore.RWStore.smallSlotType=1024
# Set the default B+Tree branching factor.
com.bigdata.btree.BTree.branchingFactor=256
com.bigdata.namespace.__globalRowStore.com.bigdata.btree.BTree.branchingFactor=592
com.bigdata.namespace.kb.lex.BLOBS.com.bigdata.btree.BTree.branchingFactor=2109
com.bigdata.namespace.kb.lex.ID2TERM.com.bigdata.btree.BTree.branchingFactor=903
com.bigdata.namespace.kb.lex.TERM2ID.com.bigdata.btree.BTree.branchingFactor=367
com.bigdata.namespace.kb.lex.search.com.bigdata.btree.BTree.branchingFactor=517
com.bigdata.namespace.kb.spo.CSPO.com.bigdata.btree.BTree.branchingFactor=731
com.bigdata.namespace.kb.spo.OCSP.com.bigdata.btree.BTree.branchingFactor=667
com.bigdata.namespace.kb.spo.PCSO.com.bigdata.btree.BTree.branchingFactor=864
com.bigdata.namespace.kb.spo.POCS.com.bigdata.btree.BTree.branchingFactor=816
com.bigdata.namespace.kb.spo.SOPC.com.bigdata.btree.BTree.branchingFactor=630
com.bigdata.namespace.kb.spo.SPOC.com.bigdata.btree.BTree.branchingFactor=604
# Set the default B+Tree retention queue capacity.
com.bigdata.btree.writeRetentionQueue.capacity=4000
I've bee trying various ways to load in a ~677 million triple .nt file with no luck. My last attempt was to use linux split to parse the files in 90k triples per file and run the DataLoader from command line. Everything seems fine at the start ~23k triples per sec loading. Then several hours later there is only a trickle of triples going through. I saw the same performance problems without splitting the file.
I'm using the latest blazegraph.jar file from sourceforge. Here is the command-line, properties file and partial terminal output.
Any help/suggestions would be appreciated. Thanks!