blazegraph / database

Blazegraph High Performance Graph Database
GNU General Public License v2.0
896 stars 173 forks source link

Data corruption in Blazegraph database #114

Open smalyshev opened 5 years ago

smalyshev commented 5 years ago

Blazegraph started producing these errors today for us, during routine update cycle:

22:57:10.414 [com.bigdata.journal.Journal.executorService1571] ERROR com.bigdata.rdf.spo.SPORelation IP: UA: - java.util.concurrent.ExecutionException: com.bigdata.btree.EvictionError: java.lang.AssertionError: Record exists for offset in cache: offset=2147483616
java.lang.AssertionError: Record exists for offset in cache: offset=2147483616
        at com.bigdata.io.writecache.WriteCache.write(WriteCache.java:977)
Wrapped by: com.bigdata.btree.EvictionError: java.lang.AssertionError: Record exists for offset in cache: offset=2147483616
        at com.bigdata.btree.DefaultEvictionListener.doEviction(DefaultEvictionListener.java:198)
Wrapped by: java.util.concurrent.ExecutionException: com.bigdata.btree.EvictionError: java.lang.AssertionError: Record exists for offset in cache: offset=2147483616
        at java.util.concurrent.FutureTask.report(FutureTask.java:122)
22:57:12.010 [qtp1321640594-1977] ERROR c.b.r.sail.webapp.BigdataRDFServlet IP:localhost UA:Jetty/9.4.z-SNAPSHOT - cause=java.util.concurrent.ExecutionException: java.util.concurrent.ExecutionException: org.openrdf.query.UpdateExecutionException: org.openrdf.repository.RepositoryException: java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.RuntimeException: java.util.concurrent.ExecutionException: com.bigdata.btree.EvictionError: java.lang.AssertionError: Record exists for offset in cache: offset=2147483616, query=SPARQL-UPDATE: updateStr=# Clear out of date site links
....

The query is a regular (rather big) SPARQL Update which Wikidata Query Service Updater sends.

The error seems to be persistent - if I restart, I get this:

00:18:48.261 [com.bigdata.rdf.sail.webapp.BigdataRDFContext.queryService25] ERROR com.bigdata.journal.Name2Addr IP: UA: - l.name: java.util.concurrent.ExecutionException: java.lang.RuntimeException: Could not commit index: name=wdq.spo.OSP                                         java.lang.AssertionError: Record exists for offset in cache: offset=2147483616                                                              
        at com.bigdata.io.writecache.WriteCache.write(WriteCache.java:977)                                                                  
Wrapped by: java.lang.RuntimeException: Could not commit index: name=wdq.spo.OSP                                                                    at com.bigdata.journal.Name2Addr$CommitIndexTask.call(Name2Addr.java:578)                                                           Wrapped by: java.util.concurrent.ExecutionException: java.lang.RuntimeException: Could not commit index: name=wdq.spo.OSP                   
        at java.util.concurrent.FutureTask.report(FutureTask.java:122)                                                                      00:18:48.262 [com.bigdata.rdf.sail.webapp.BigdataRDFContext.queryService25] ERROR com.bigdata.journal.Name2Addr IP: UA: - l.name: java.util.concurrent.ExecutionException: java.lang.RuntimeException: Could not commit index: name=wdq.spo.POS                                         java.lang.AssertionError: Record exists for offset in cache: offset=2147483616                                                              
        at com.bigdata.io.writecache.WriteCache.write(WriteCache.java:977)                             
Wrapped by: java.lang.RuntimeException: Could not commit index: name=wdq.spo.POS                                                            
        at com.bigdata.journal.Name2Addr$CommitIndexTask.call(Name2Addr.java:578)                                                           Wrapped by: java.util.concurrent.ExecutionException: java.lang.RuntimeException: Could not commit index: name=wdq.spo.POS                   
        at java.util.concurrent.FutureTask.report(FutureTask.java:122)                                                                      
00:18:49.365 [qtp1321640594-32] ERROR c.b.r.sail.webapp.BigdataRDFServlet IP:localhost UA:Jetty/9.4.z-SNAPSHOT - cause=java.util.concurrent.ExecutionException: java.util.concurrent.ExecutionException: org.openrdf.query.UpdateExecutionException: java.lang.RuntimeException: nerrors=2: lastRootBlock=rootBlock{ rootBlock=1, challisField=41668719, version=3, nextOffset=52105301987554934, localTime=1546901828460 [Monday, January 7, 2019 10:57:08 PM UTC], firstCommitTime=1510337399915 [Friday, November 10, 2017 6:09:59 PM UTC], lastCommitTime=1546901828417 [Monday, January 7, 2019 10:57:08 PM UTC], commitCounter=41668719, commitRecordAddr={off=NATIVE:-2145194496,len=422}, commitRecordIndexAddr={off
=NATIVE:-526146533,len=220}, blockSequence=1, quorumToken=-1, metaBitsAddr=8454737611466036, metaStartAddr=13189238, storeType=RW, uuid=84e0
fc40-3d8f-47fb-8e01-c9899e2730f3, offsetBits=42, checksum=952184268, createTime=1510337398696 [Friday, November 10, 2017 6:09:58 PM UTC], cl
oseTime=0}, query=SPARQL-UPDATE: updateStr=# Clear out of date site links              
... skipped query ....
com.bigdata.util.concurrent.ExecutionExceptions: 2 errors : [java.util.concurrent.ExecutionException: java.lang.RuntimeException: Could not 
commit index: name=wdq.spo.OSP, java.util.concurrent.ExecutionException: java.lang.RuntimeException: Could not commit index: name=wdq.spo.POS]                                                                                                                                                  at com.bigdata.journal.Name2Addr.handleCommit(Name2Addr.java:860)                                                                   
Wrapped by: java.lang.RuntimeException: nerrors=2                                                                                                   at com.bigdata.journal.Name2Addr.handleCommit(Name2Addr.java:860)                                                                   Wrapped by: java.lang.RuntimeException: nerrors=2: lastRootBlock=rootBlock{ rootBlock=1, challisField=41668719, version=3, nextOffset=52105301987554934, localTime=1546901828460 [Monday, January 7, 2019 10:57:08 PM UTC], firstCommitTime=1510337399915 [Friday, November 10, 2017 6:0
9:59 PM UTC], lastCommitTime=1546901828417 [Monday, January 7, 2019 10:57:08 PM UTC], commitCounter=41668719, commitRecordAddr={off=NATIVE:-
2145194496,len=422}, commitRecordIndexAddr={off=NATIVE:-526146533,len=220}, blockSequence=1, quorumToken=-1, metaBitsAddr=8454737611466036,
metaStartAddr=13189238, storeType=RW, uuid=84e0fc40-3d8f-47fb-8e01-c9899e2730f3, offsetBits=42, checksum=952184268, createTime=1510337398696 [Friday, November 10, 2017 6:09:58 PM UTC], closeTime=0}                                                                                   
        at com.bigdata.journal.AbstractJournal.commit(AbstractJournal.java:3134)                                                            
Wrapped by: org.openrdf.query.UpdateExecutionException: java.lang.RuntimeException: nerrors=2: lastRootBlock=rootBlock{ rootBlock=1, challisField=41668719, version=3, nextOffset=52105301987554934, localTime=1546901828460 [Monday, January 7, 2019 10:57:08 PM UTC], firstCommitTime=1510337399915 [Friday, November 10, 2017 6:09:59 PM UTC], lastCommitTime=1546901828417 [Monday, January 7, 2019 10:57:08 PM UTC], commitCounter=41668719, commitRecordAddr={off=NATIVE:-2145194496,len=422}, commitRecordIndexAddr={off=NATIVE:-526146533,len=220}, blockSequence=1, quorumToken=-1, metaBitsAddr=8454737611466036, metaStartAddr=13189238, storeType=RW, uuid=84e0fc40-3d8f-47fb-8e01-c9899e2730f3, offsetBits=42, 
checksum=952184268, createTime=1510337398696 [Friday, November 10, 2017 6:09:58 PM UTC], closeTime=0}                                       
        at com.bigdata.rdf.sparql.ast.eval.ASTEvalHelper.executeUpdate(ASTEvalHelper.java:1080)                                             
Wrapped by: java.util.concurrent.ExecutionException: org.openrdf.query.UpdateExecutionException: java.lang.RuntimeException: nerrors=2: lastRootBlock=rootBlock{ rootBlock=1, challisField=41668719, version=3, nextOffset=52105301987554934, localTime=1546901828460 [Monday, January 7, 2019 10:57:08 PM UTC], firstCommitTime=1510337399915 [Friday, November 10, 2017 6:09:59 PM UTC], lastCommitTime=1546901828417 [Monday, January 7, 2019 10:57:08 PM UTC], commitCounter=41668719, commitRecordAddr={off=NATIVE:-2145194496,len=422}, commitRecordIndexAddr={off=NATIVE:-526146533,len=220}, blockSequence=1, quorumToken=-1, metaBitsAddr=8454737611466036, metaStartAddr=13189238, storeType=RW, uuid=84e0fc40-3d8f-47fb-8e01-c9899e2730f3, offsetBits=42, checksum=952184268, createTime=1510337398696 [Friday, November 10, 2017 6:09:58 PM UTC], closeTime=0}                                                                                                                                          
        at java.util.concurrent.FutureTask.report(FutureTask.java:122)                                                                      
Wrapped by: java.util.concurrent.ExecutionException: java.util.concurrent.ExecutionException: org.openrdf.query.UpdateExecutionException: java.lang.RuntimeException: nerrors=2: lastRootBlock=rootBlock{ rootBlock=1, challisField=41668719, version=3, nextOffset=52105301987554934, localTime=1546901828460 [Monday, January 7, 2019 10:57:08 PM UTC], firstCommitTime=1510337399915 [Friday, November 10, 2017 6:09:59 PM UTC], 
lastCommitTime=1546901828417 [Monday, January 7, 2019 10:57:08 PM UTC], commitCounter=41668719, commitRecordAddr={off=NATIVE:-2145194496,len=422}, commitRecordIndexAddr={off=NATIVE:-526146533,len=220}, blockSequence=1, quorumToken=-1, metaBitsAddr=8454737611466036, metaStartAddr=13189238, storeType=RW, uuid=84e0fc40-3d8f-47fb-8e01-c9899e2730f3, offsetBits=42, checksum=952184268, createTime=1510337398696 [Friday, November 10, 2017 6:09:58 PM UTC], closeTime=0}                                                                                                 
        at java.util.concurrent.FutureTask.report(FutureTask.java:122)         

and all write operations seem to fail.

Any ideas what's going on and why it could have happened?

thompsonbry commented 5 years ago

Adding Martyn.

Bryan

On Mon, Jan 7, 2019 at 16:21 Stanislav Malyshev notifications@github.com wrote:

Blazegraph started producing these errors today for us, during routine update cycle:

22:57:10.414 [com.bigdata.journal.Journal.executorService1571] ERROR com.bigdata.rdf.spo.SPORelation IP: UA: - java.util.concurrent.ExecutionException: com.bigdata.btree.EvictionError: java.lang.AssertionError: Record exists for offset in cache: offset=2147483616 java.lang.AssertionError: Record exists for offset in cache: offset=2147483616 at com.bigdata.io.writecache.WriteCache.write(WriteCache.java:977) Wrapped by: com.bigdata.btree.EvictionError: java.lang.AssertionError: Record exists for offset in cache: offset=2147483616 at com.bigdata.btree.DefaultEvictionListener.doEviction(DefaultEvictionListener.java:198) Wrapped by: java.util.concurrent.ExecutionException: com.bigdata.btree.EvictionError: java.lang.AssertionError: Record exists for offset in cache: offset=2147483616 at java.util.concurrent.FutureTask.report(FutureTask.java:122) 22:57:12.010 [qtp1321640594-1977] ERROR c.b.r.sail.webapp.BigdataRDFServlet IP:localhost UA:Jetty/9.4.z-SNAPSHOT - cause=java.util.concurrent.ExecutionException: java.util.concurrent.ExecutionException: org.openrdf.query.UpdateExecutionException: org.openrdf.repository.RepositoryException: java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.RuntimeException: java.util.concurrent.ExecutionException: com.bigdata.btree.EvictionError: java.lang.AssertionError: Record exists for offset in cache: offset=2147483616, query=SPARQL-UPDATE: updateStr=# Clear out of date site links ....

The query is a regular (rather big) SPARQL Update which Wikidata Query Service Updater sends.

The error seems to be persistent - if I restart, I get this:

00:18:48.261 [com.bigdata.rdf.sail.webapp.BigdataRDFContext.queryService25] ERROR com.bigdata.journal.Name2Addr IP: UA: - l.name: java.util.concurrent.ExecutionException: java.lang.RuntimeException: Could not commit index: name=wdq.spo.OSP java.lang.AssertionError: Record exists for offset in cache: offset=2147483616 at com.bigdata.io.writecache.WriteCache.write(WriteCache.java:977) Wrapped by: java.lang.RuntimeException: Could not commit index: name=wdq.spo.OSP at com.bigdata.journal.Name2Addr$CommitIndexTask.call(Name2Addr.java:578) Wrapped by: java.util.concurrent.ExecutionException: java.lang.RuntimeException: Could not commit index: name=wdq.spo.OSP at java.util.concurrent.FutureTask.report(FutureTask.java:122) 00:18:48.262 [com.bigdata.rdf.sail.webapp.BigdataRDFContext.queryService25] ERROR com.bigdata.journal.Name2Addr IP: UA: - l.name: java.util.concurrent.ExecutionException: java.lang.RuntimeException: Could not commit index: name=wdq.spo.POS java.lang.AssertionError: Record exists for offset in cache: offset=2147483616 at com.bigdata.io.writecache.WriteCache.write(WriteCache.java:977) Wrapped by: java.lang.RuntimeException: Could not commit index: name=wdq.spo.POS at com.bigdata.journal.Name2Addr$CommitIndexTask.call(Name2Addr.java:578) Wrapped by: java.util.concurrent.ExecutionException: java.lang.RuntimeException: Could not commit index: name=wdq.spo.POS at java.util.concurrent.FutureTask.report(FutureTask.java:122) 00:18:49.365 [qtp1321640594-32] ERROR c.b.r.sail.webapp.BigdataRDFServlet IP:localhost UA:Jetty/9.4.z-SNAPSHOT - cause=java.util.concurrent.ExecutionException: java.util.concurrent.ExecutionException: org.openrdf.query.UpdateExecutionException: java.lang.RuntimeException: nerrors=2: lastRootBlock=rootBlock{ rootBlock=1, challisField=41668719, version=3, nextOffset=52105301987554934, localTime=1546901828460 [Monday, January 7, 2019 10:57:08 PM UTC], firstCommitTime=1510337399915 [Friday, November 10, 2017 6:09:59 PM UTC], lastCommitTime=1546901828417 [Monday, January 7, 2019 10:57:08 PM UTC], commitCounter=41668719, commitRecordAddr={off=NATIVE:-2145194496,len=422}, commitRecordIndexAddr={off =NATIVE:-526146533,len=220}, blockSequence=1, quorumToken=-1, metaBitsAddr=8454737611466036, metaStartAddr=13189238, storeType=RW, uuid=84e0 fc40-3d8f-47fb-8e01-c9899e2730f3, offsetBits=42, checksum=952184268, createTime=1510337398696 [Friday, November 10, 2017 6:09:58 PM UTC], cl oseTime=0}, query=SPARQL-UPDATE: updateStr=# Clear out of date site links ... skipped query .... com.bigdata.util.concurrent.ExecutionExceptions: 2 errors : [java.util.concurrent.ExecutionException: java.lang.RuntimeException: Could not commit index: name=wdq.spo.OSP, java.util.concurrent.ExecutionException: java.lang.RuntimeException: Could not commit index: name=wdq.spo.POS] at com.bigdata.journal.Name2Addr.handleCommit(Name2Addr.java:860) Wrapped by: java.lang.RuntimeException: nerrors=2 at com.bigdata.journal.Name2Addr.handleCommit(Name2Addr.java:860) Wrapped by: java.lang.RuntimeException: nerrors=2: lastRootBlock=rootBlock{ rootBlock=1, challisField=41668719, version=3, nextOffset=52105301987554934, localTime=1546901828460 [Monday, January 7, 2019 10:57:08 PM UTC], firstCommitTime=1510337399915 [Friday, November 10, 2017 6:0 9:59 PM UTC], lastCommitTime=1546901828417 [Monday, January 7, 2019 10:57:08 PM UTC], commitCounter=41668719, commitRecordAddr={off=NATIVE:- 2145194496,len=422}, commitRecordIndexAddr={off=NATIVE:-526146533,len=220}, blockSequence=1, quorumToken=-1, metaBitsAddr=8454737611466036, metaStartAddr=13189238, storeType=RW, uuid=84e0fc40-3d8f-47fb-8e01-c9899e2730f3, offsetBits=42, checksum=952184268, createTime=1510337398696 [Friday, November 10, 2017 6:09:58 PM UTC], closeTime=0} at com.bigdata.journal.AbstractJournal.commit(AbstractJournal.java:3134) Wrapped by: org.openrdf.query.UpdateExecutionException: java.lang.RuntimeException: nerrors=2: lastRootBlock=rootBlock{ rootBlock=1, challisField=41668719, version=3, nextOffset=52105301987554934, localTime=1546901828460 [Monday, January 7, 2019 10:57:08 PM UTC], firstCommitTime=1510337399915 [Friday, November 10, 2017 6:09:59 PM UTC], lastCommitTime=1546901828417 [Monday, January 7, 2019 10:57:08 PM UTC], commitCounter=41668719, commitRecordAddr={off=NATIVE:-2145194496,len=422}, commitRecordIndexAddr={off=NATIVE:-526146533,len=220}, blockSequence=1, quorumToken=-1, metaBitsAddr=8454737611466036, metaStartAddr=13189238, storeType=RW, uuid=84e0fc40-3d8f-47fb-8e01-c9899e2730f3, offsetBits=42, checksum=952184268, createTime=1510337398696 [Friday, November 10, 2017 6:09:58 PM UTC], closeTime=0} at com.bigdata.rdf.sparql.ast.eval.ASTEvalHelper.executeUpdate(ASTEvalHelper.java:1080) Wrapped by: java.util.concurrent.ExecutionException: org.openrdf.query.UpdateExecutionException: java.lang.RuntimeException: nerrors=2: lastRootBlock=rootBlock{ rootBlock=1, challisField=41668719, version=3, nextOffset=52105301987554934, localTime=1546901828460 [Monday, January 7, 2019 10:57:08 PM UTC], firstCommitTime=1510337399915 [Friday, November 10, 2017 6:09:59 PM UTC], lastCommitTime=1546901828417 [Monday, January 7, 2019 10:57:08 PM UTC], commitCounter=41668719, commitRecordAddr={off=NATIVE:-2145194496,len=422}, commitRecordIndexAddr={off=NATIVE:-526146533,len=220}, blockSequence=1, quorumToken=-1, metaBitsAddr=8454737611466036, metaStartAddr=13189238, storeType=RW, uuid=84e0fc40-3d8f-47fb-8e01-c9899e2730f3, offsetBits=42, checksum=952184268, createTime=1510337398696 [Friday, November 10, 2017 6:09:58 PM UTC], closeTime=0} at java.util.concurrent.FutureTask.report(FutureTask.java:122) Wrapped by: java.util.concurrent.ExecutionException: java.util.concurrent.ExecutionException: org.openrdf.query.UpdateExecutionException: java.lang.RuntimeException: nerrors=2: lastRootBlock=rootBlock{ rootBlock=1, challisField=41668719, version=3, nextOffset=52105301987554934, localTime=1546901828460 [Monday, January 7, 2019 10:57:08 PM UTC], firstCommitTime=1510337399915 [Friday, November 10, 2017 6:09:59 PM UTC], lastCommitTime=1546901828417 [Monday, January 7, 2019 10:57:08 PM UTC], commitCounter=41668719, commitRecordAddr={off=NATIVE:-2145194496,len=422}, commitRecordIndexAddr={off=NATIVE:-526146533,len=220}, blockSequence=1, quorumToken=-1, metaBitsAddr=8454737611466036, metaStartAddr=13189238, storeType=RW, uuid=84e0fc40-3d8f-47fb-8e01-c9899e2730f3, offsetBits=42, checksum=952184268, createTime=1510337398696 [Friday, November 10, 2017 6:09:58 PM UTC], closeTime=0} at java.util.concurrent.FutureTask.report(FutureTask.java:122)

and all write operations seem to fail.

Any ideas what's going on and why it could have happened?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/blazegraph/database/issues/114, or mute the thread https://github.com/notifications/unsubscribe-auth/ACdv4JKgQguYGOdGmTefL1Hx6-5PCFLsks5vA-SIgaJpZM4Z0i2I .

thompsonbry commented 5 years ago

Stas,

Here is where the exception is being thrown

https://github.com/blazegraph/database/blob/64bb8ca92eb063d45d68ec920a34b5321b80e2fe/bigdata-core/bigdata/src/java/com/bigdata/io/writecache/WriteCache.java#L978

/*

Obviously, this is an unexpected condition. Further, I would expect this problem to NOT be restart durable.

I suggest:

(a) check the logs after the restart. Are any other errors being reported? (b) the initial exception you reported above is for commitCounter=41668719. Is this still the most recent commit point on the database? (You can use DumpJournal to peek at the root blocks - and in fact I believe that there is a servlet API for this as well.) (c) If the commit counter HAS NOT CHANGED, then there might be a "bad commit" and you MIGHT be able to rollback to the last commit point. This is not an option to be taken lightly, and we should probably talk before you attempt this. It basically discards the most recent commit in favor of the previous commit. There are com.bigdata.journal.Options which will let you affect this rollback. If there is only one bad commit, then this could get you back up and running. (d) If you are still having commits roll through, but just not some specific update then the problem might be somehow triggered by a specific update. This would be a best case scenario since it would imply that the database is still working correctly, but unable to apply a specific update. In this case, you might be able to break down the update and work around the problem. (e) A possible scenario leading to this situation would be a breakdown in the commit logic such that some data remained buffered and was then flushed through by the next transaction. This sort of thing could arise from a concurrent modification of a B+Tree. However, there are no know bugs around this.

Bryan

On Mon, Jan 7, 2019 at 4:28 PM Bryan B. Thompson thompsonbry@gmail.com wrote:

Adding Martyn.

Bryan

On Mon, Jan 7, 2019 at 16:21 Stanislav Malyshev notifications@github.com wrote:

Blazegraph started producing these errors today for us, during routine update cycle:

22:57:10.414 [com.bigdata.journal.Journal.executorService1571] ERROR com.bigdata.rdf.spo.SPORelation IP: UA: - java.util.concurrent.ExecutionException: com.bigdata.btree.EvictionError: java.lang.AssertionError: Record exists for offset in cache: offset=2147483616 java.lang.AssertionError: Record exists for offset in cache: offset=2147483616 at com.bigdata.io.writecache.WriteCache.write(WriteCache.java:977) Wrapped by: com.bigdata.btree.EvictionError: java.lang.AssertionError: Record exists for offset in cache: offset=2147483616 at com.bigdata.btree.DefaultEvictionListener.doEviction(DefaultEvictionListener.java:198) Wrapped by: java.util.concurrent.ExecutionException: com.bigdata.btree.EvictionError: java.lang.AssertionError: Record exists for offset in cache: offset=2147483616 at java.util.concurrent.FutureTask.report(FutureTask.java:122) 22:57:12.010 [qtp1321640594-1977] ERROR c.b.r.sail.webapp.BigdataRDFServlet IP:localhost UA:Jetty/9.4.z-SNAPSHOT - cause=java.util.concurrent.ExecutionException: java.util.concurrent.ExecutionException: org.openrdf.query.UpdateExecutionException: org.openrdf.repository.RepositoryException: java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.RuntimeException: java.util.concurrent.ExecutionException: com.bigdata.btree.EvictionError: java.lang.AssertionError: Record exists for offset in cache: offset=2147483616, query=SPARQL-UPDATE: updateStr=# Clear out of date site links ....

The query is a regular (rather big) SPARQL Update which Wikidata Query Service Updater sends.

The error seems to be persistent - if I restart, I get this:

00:18:48.261 [com.bigdata.rdf.sail.webapp.BigdataRDFContext.queryService25] ERROR com.bigdata.journal.Name2Addr IP: UA: - l.name: java.util.concurrent.ExecutionException: java.lang.RuntimeException: Could not commit index: name=wdq.spo.OSP java.lang.AssertionError: Record exists for offset in cache: offset=2147483616 at com.bigdata.io.writecache.WriteCache.write(WriteCache.java:977) Wrapped by: java.lang.RuntimeException: Could not commit index: name=wdq.spo.OSP at com.bigdata.journal.Name2Addr$CommitIndexTask.call(Name2Addr.java:578) Wrapped by: java.util.concurrent.ExecutionException: java.lang.RuntimeException: Could not commit index: name=wdq.spo.OSP at java.util.concurrent.FutureTask.report(FutureTask.java:122) 00:18:48.262 [com.bigdata.rdf.sail.webapp.BigdataRDFContext.queryService25] ERROR com.bigdata.journal.Name2Addr IP: UA: - l.name: java.util.concurrent.ExecutionException: java.lang.RuntimeException: Could not commit index: name=wdq.spo.POS java.lang.AssertionError: Record exists for offset in cache: offset=2147483616 at com.bigdata.io.writecache.WriteCache.write(WriteCache.java:977) Wrapped by: java.lang.RuntimeException: Could not commit index: name=wdq.spo.POS at com.bigdata.journal.Name2Addr$CommitIndexTask.call(Name2Addr.java:578) Wrapped by: java.util.concurrent.ExecutionException: java.lang.RuntimeException: Could not commit index: name=wdq.spo.POS at java.util.concurrent.FutureTask.report(FutureTask.java:122) 00:18:49.365 [qtp1321640594-32] ERROR c.b.r.sail.webapp.BigdataRDFServlet IP:localhost UA:Jetty/9.4.z-SNAPSHOT - cause=java.util.concurrent.ExecutionException: java.util.concurrent.ExecutionException: org.openrdf.query.UpdateExecutionException: java.lang.RuntimeException: nerrors=2: lastRootBlock=rootBlock{ rootBlock=1, challisField=41668719, version=3, nextOffset=52105301987554934, localTime=1546901828460 [Monday, January 7, 2019 10:57:08 PM UTC], firstCommitTime=1510337399915 [Friday, November 10, 2017 6:09:59 PM UTC], lastCommitTime=1546901828417 [Monday, January 7, 2019 10:57:08 PM UTC], commitCounter=41668719, commitRecordAddr={off=NATIVE:-2145194496,len=422}, commitRecordIndexAddr={off =NATIVE:-526146533,len=220}, blockSequence=1, quorumToken=-1, metaBitsAddr=8454737611466036, metaStartAddr=13189238, storeType=RW, uuid=84e0 fc40-3d8f-47fb-8e01-c9899e2730f3, offsetBits=42, checksum=952184268, createTime=1510337398696 [Friday, November 10, 2017 6:09:58 PM UTC], cl oseTime=0}, query=SPARQL-UPDATE: updateStr=# Clear out of date site links ... skipped query .... com.bigdata.util.concurrent.ExecutionExceptions: 2 errors : [java.util.concurrent.ExecutionException: java.lang.RuntimeException: Could not commit index: name=wdq.spo.OSP, java.util.concurrent.ExecutionException: java.lang.RuntimeException: Could not commit index: name=wdq.spo.POS] at com.bigdata.journal.Name2Addr.handleCommit(Name2Addr.java:860) Wrapped by: java.lang.RuntimeException: nerrors=2 at com.bigdata.journal.Name2Addr.handleCommit(Name2Addr.java:860) Wrapped by: java.lang.RuntimeException: nerrors=2: lastRootBlock=rootBlock{ rootBlock=1, challisField=41668719, version=3, nextOffset=52105301987554934, localTime=1546901828460 [Monday, January 7, 2019 10:57:08 PM UTC], firstCommitTime=1510337399915 [Friday, November 10, 2017 6:0 9:59 PM UTC], lastCommitTime=1546901828417 [Monday, January 7, 2019 10:57:08 PM UTC], commitCounter=41668719, commitRecordAddr={off=NATIVE:- 2145194496,len=422}, commitRecordIndexAddr={off=NATIVE:-526146533,len=220}, blockSequence=1, quorumToken=-1, metaBitsAddr=8454737611466036, metaStartAddr=13189238, storeType=RW, uuid=84e0fc40-3d8f-47fb-8e01-c9899e2730f3, offsetBits=42, checksum=952184268, createTime=1510337398696 [Friday, November 10, 2017 6:09:58 PM UTC], closeTime=0} at com.bigdata.journal.AbstractJournal.commit(AbstractJournal.java:3134) Wrapped by: org.openrdf.query.UpdateExecutionException: java.lang.RuntimeException: nerrors=2: lastRootBlock=rootBlock{ rootBlock=1, challisField=41668719, version=3, nextOffset=52105301987554934, localTime=1546901828460 [Monday, January 7, 2019 10:57:08 PM UTC], firstCommitTime=1510337399915 [Friday, November 10, 2017 6:09:59 PM UTC], lastCommitTime=1546901828417 [Monday, January 7, 2019 10:57:08 PM UTC], commitCounter=41668719, commitRecordAddr={off=NATIVE:-2145194496,len=422}, commitRecordIndexAddr={off=NATIVE:-526146533,len=220}, blockSequence=1, quorumToken=-1, metaBitsAddr=8454737611466036, metaStartAddr=13189238, storeType=RW, uuid=84e0fc40-3d8f-47fb-8e01-c9899e2730f3, offsetBits=42, checksum=952184268, createTime=1510337398696 [Friday, November 10, 2017 6:09:58 PM UTC], closeTime=0} at com.bigdata.rdf.sparql.ast.eval.ASTEvalHelper.executeUpdate(ASTEvalHelper.java:1080) Wrapped by: java.util.concurrent.ExecutionException: org.openrdf.query.UpdateExecutionException: java.lang.RuntimeException: nerrors=2: lastRootBlock=rootBlock{ rootBlock=1, challisField=41668719, version=3, nextOffset=52105301987554934, localTime=1546901828460 [Monday, January 7, 2019 10:57:08 PM UTC], firstCommitTime=1510337399915 [Friday, November 10, 2017 6:09:59 PM UTC], lastCommitTime=1546901828417 [Monday, January 7, 2019 10:57:08 PM UTC], commitCounter=41668719, commitRecordAddr={off=NATIVE:-2145194496,len=422}, commitRecordIndexAddr={off=NATIVE:-526146533,len=220}, blockSequence=1, quorumToken=-1, metaBitsAddr=8454737611466036, metaStartAddr=13189238, storeType=RW, uuid=84e0fc40-3d8f-47fb-8e01-c9899e2730f3, offsetBits=42, checksum=952184268, createTime=1510337398696 [Friday, November 10, 2017 6:09:58 PM UTC], closeTime=0} at java.util.concurrent.FutureTask.report(FutureTask.java:122) Wrapped by: java.util.concurrent.ExecutionException: java.util.concurrent.ExecutionException: org.openrdf.query.UpdateExecutionException: java.lang.RuntimeException: nerrors=2: lastRootBlock=rootBlock{ rootBlock=1, challisField=41668719, version=3, nextOffset=52105301987554934, localTime=1546901828460 [Monday, January 7, 2019 10:57:08 PM UTC], firstCommitTime=1510337399915 [Friday, November 10, 2017 6:09:59 PM UTC], lastCommitTime=1546901828417 [Monday, January 7, 2019 10:57:08 PM UTC], commitCounter=41668719, commitRecordAddr={off=NATIVE:-2145194496,len=422}, commitRecordIndexAddr={off=NATIVE:-526146533,len=220}, blockSequence=1, quorumToken=-1, metaBitsAddr=8454737611466036, metaStartAddr=13189238, storeType=RW, uuid=84e0fc40-3d8f-47fb-8e01-c9899e2730f3, offsetBits=42, checksum=952184268, createTime=1510337398696 [Friday, November 10, 2017 6:09:58 PM UTC], closeTime=0} at java.util.concurrent.FutureTask.report(FutureTask.java:122)

and all write operations seem to fail.

Any ideas what's going on and why it could have happened?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/blazegraph/database/issues/114, or mute the thread https://github.com/notifications/unsubscribe-auth/ACdv4JKgQguYGOdGmTefL1Hx6-5PCFLsks5vA-SIgaJpZM4Z0i2I .

smalyshev commented 5 years ago

(a) check the logs after the restart. Are any other errors being reported?

Yes, the second log is after the restart. No writes seem to be successful after that error happened.

the initial exception you reported above is for commitCounter=41668719. Is this still the most recent commit point on the database? (You can use DumpJournal to peek at the root blocks - and in fact I believe that there is a servlet API for this as well.)

dumpJournal says:

There are 2 commit points.
CommitRecord{timestamp=1546901828417, commitCounter=41668719, roots=[-2259782070534405924, -9213540186699333292, -2259782126368980696, -211123412402045, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}

I assume this means it's still 41668719? Full copy is at https://gist.github.com/smalyshev/b04d1366fdffb0fffc27c4a95901caed

We can copy the DB from another server, but it's a time-consuming thing, so if we could easily restore the DB to working order, it'd be preferable. Since we have last update timestamp in the database, it could then re-apply updates from that point forward.

thompsonbry commented 5 years ago

Let’s talk in 5m. Ok?

On Mon, Jan 7, 2019 at 17:00 Stanislav Malyshev notifications@github.com wrote:

(a) check the logs after the restart. Are any other errors being reported?

Yes, the second log is after the restart. No writes seem to be successful after that error happened.

the initial exception you reported above is for commitCounter=41668719. Is this still the most recent commit point on the database? (You can use DumpJournal to peek at the root blocks - and in fact I believe that there is a servlet API for this as well.)

dumpJournal says:

There are 2 commit points. CommitRecord{timestamp=1546901828417, commitCounter=41668719, roots=[-2259782070534405924, -9213540186699333292, -2259782126368980696, -211123412402045, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}

I assume this means it's still 41668719? Full copy is at https://gist.github.com/smalyshev/b04d1366fdffb0fffc27c4a95901caed

We can copy the DB from another server, but it's a time-consuming thing, so if we could easily restore the DB to working order, it'd be preferable. Since we have last update timestamp in the database, it could then re-apply updates from that point forward.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/blazegraph/database/issues/114#issuecomment-452137211, or mute the thread https://github.com/notifications/unsubscribe-auth/ACdv4A5SqIZ7zuXg7TgAs0T7-Wotq91bks5vA-2pgaJpZM4Z0i2I .

thompsonbry commented 5 years ago

What was the last successful transaction?

On Mon, Jan 7, 2019 at 5:00 PM Stanislav Malyshev notifications@github.com wrote:

(a) check the logs after the restart. Are any other errors being reported?

Yes, the second log is after the restart. No writes seem to be successful after that error happened.

the initial exception you reported above is for commitCounter=41668719. Is this still the most recent commit point on the database? (You can use DumpJournal to peek at the root blocks - and in fact I believe that there is a servlet API for this as well.)

dumpJournal says:

There are 2 commit points. CommitRecord{timestamp=1546901828417, commitCounter=41668719, roots=[-2259782070534405924, -9213540186699333292, -2259782126368980696, -211123412402045, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}

I assume this means it's still 41668719? Full copy is at https://gist.github.com/smalyshev/b04d1366fdffb0fffc27c4a95901caed

We can copy the DB from another server, but it's a time-consuming thing, so if we could easily restore the DB to working order, it'd be preferable. Since we have last update timestamp in the database, it could then re-apply updates from that point forward.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/blazegraph/database/issues/114#issuecomment-452137211, or mute the thread https://github.com/notifications/unsubscribe-auth/ACdv4A5SqIZ7zuXg7TgAs0T7-Wotq91bks5vA-2pgaJpZM4Z0i2I .

thompsonbry commented 5 years ago

See https://github.com/blazegraph/database/blob/master/bigdata-core/bigdata/src/java/com/bigdata/journal/Options.java#L433 for the ALTERNATE_ROOT_BLOCK option

On Mon, Jan 7, 2019 at 5:20 PM Bryan B. Thompson thompsonbry@gmail.com wrote:

What was the last successful transaction?

On Mon, Jan 7, 2019 at 5:00 PM Stanislav Malyshev < notifications@github.com> wrote:

(a) check the logs after the restart. Are any other errors being reported?

Yes, the second log is after the restart. No writes seem to be successful after that error happened.

the initial exception you reported above is for commitCounter=41668719. Is this still the most recent commit point on the database? (You can use DumpJournal to peek at the root blocks - and in fact I believe that there is a servlet API for this as well.)

dumpJournal says:

There are 2 commit points. CommitRecord{timestamp=1546901828417, commitCounter=41668719, roots=[-2259782070534405924, -9213540186699333292, -2259782126368980696, -211123412402045, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}

I assume this means it's still 41668719? Full copy is at https://gist.github.com/smalyshev/b04d1366fdffb0fffc27c4a95901caed

We can copy the DB from another server, but it's a time-consuming thing, so if we could easily restore the DB to working order, it'd be preferable. Since we have last update timestamp in the database, it could then re-apply updates from that point forward.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/blazegraph/database/issues/114#issuecomment-452137211, or mute the thread https://github.com/notifications/unsubscribe-auth/ACdv4A5SqIZ7zuXg7TgAs0T7-Wotq91bks5vA-2pgaJpZM4Z0i2I .

smalyshev commented 5 years ago

What was the last successful transaction?

Unfortunately, I don't know the exact content of last transaction. We do not log update queries, since they are huge and there's a lot of them. It was probably the update before the one that caused the error, but it's very hard now to know what exactly it contained.

smalyshev commented 5 years ago

Got another server crashing with exactly the same symptoms, three hours later... Journal dump is here: https://gist.github.com/smalyshev/bf19825ea984371a9e69666b26f23bd8. Commit counter in error messages matches the one in root block #1, just like in the first server.

thompsonbry commented 5 years ago

You can test for data corruption using dump journal. E.g., have it visit all index pages.

The root blocks are at the head of the file along with a few magic bytes. I am pretty sure that the file header is fixed, maybe 512 bytes. See the FileMetadata class and the code paths involving the RWSTORE there for the actual file header dimensions. As long as you start up in a read only mode, you could use dad to copy those bytes somewhere and then restore them if you needed to and see how the data looks with the alternate root block as well (you might also be able to start up in a read only mode using the alternate root block - I do not recall off hand whether that is disallowed) But if you use the alternate root block and then allow any commits to occur then you are now locked into the new commit history.

I strongly encourage you to figure out what updates are causing this issue. You might want to put the servers into a read only mode and suspend updates until you figure that out and we figure out the root cause.

Bryan

On Mon, Jan 7, 2019 at 20:21 Stanislav Malyshev notifications@github.com wrote:

Got another server crashing with exactly the same symptoms, three hours later... Journal dump is here: https://gist.github.com/smalyshev/bf19825ea984371a9e69666b26f23bd8. Commit counter in error messages matches the one in root block #1 https://github.com/blazegraph/database/issues/1, just like in the first server.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/blazegraph/database/issues/114#issuecomment-452170692, or mute the thread https://github.com/notifications/unsubscribe-auth/ACdv4C9c8eNepw8L0mzzznSBWLDkXGmSks5vBBzhgaJpZM4Z0i2I .

thompsonbry commented 5 years ago

Your other server has four commit points available. This means that it has not reclaimed some deleted data yet. This difference is likely due to some read transactions being open while the last updates went through, resulting in deferred recycling. Thus the binary data will be slightly different on the two (you are handling logical replication at the application level, not the database level, which means that the recycling patterns will be difference and hence the specific slots in use will not be the same and the number of reserved slots could be different as well). Since the data on the disk are not identical, but we are hitting the same issue at the same commit point, This implies that the problematic code path might be specific to some update, which we would really want to identify.

Do both servers see the same sequence of logical updates?

It could also be some internal limit not previously identified.

Bryan

On Mon, Jan 7, 2019 at 20:55 Bryan B. Thompson thompsonbry@gmail.com wrote:

You can test for data corruption using dump journal. E.g., have it visit all index pages.

The root blocks are at the head of the file along with a few magic bytes. I am pretty sure that the file header is fixed, maybe 512 bytes. See the FileMetadata class and the code paths involving the RWSTORE there for the actual file header dimensions. As long as you start up in a read only mode, you could use dad to copy those bytes somewhere and then restore them if you needed to and see how the data looks with the alternate root block as well (you might also be able to start up in a read only mode using the alternate root block - I do not recall off hand whether that is disallowed) But if you use the alternate root block and then allow any commits to occur then you are now locked into the new commit history.

I strongly encourage you to figure out what updates are causing this issue. You might want to put the servers into a read only mode and suspend updates until you figure that out and we figure out the root cause.

Bryan

On Mon, Jan 7, 2019 at 20:21 Stanislav Malyshev notifications@github.com wrote:

Got another server crashing with exactly the same symptoms, three hours later... Journal dump is here: https://gist.github.com/smalyshev/bf19825ea984371a9e69666b26f23bd8. Commit counter in error messages matches the one in root block #1 https://github.com/blazegraph/database/issues/1, just like in the first server.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/blazegraph/database/issues/114#issuecomment-452170692, or mute the thread https://github.com/notifications/unsubscribe-auth/ACdv4C9c8eNepw8L0mzzznSBWLDkXGmSks5vBBzhgaJpZM4Z0i2I .

smalyshev commented 5 years ago

Do both servers see the same sequence of logical updates?

Not exactly. The update sequence is the same (it comes from the same event channel) but starting points may differ, so for example one server can have two batches of updates 1,2,3 and 4,5,6 and another can have 1,2 and 3,4,5,6, etc. The end result should be the same (almost, since we also store update timestamps per entry which may differ slightly) but the sequence may differ.

It could also be some internal limit not previously identified.

File sizes for both "bad" servers are 806G. But we have other servers with the same file size, and they are just fine.

Your other server has four commit points available. This means that it has not reclaimed some deleted data yet. This difference is likely due to some read transactions being open while the last updates went through, resulting in deferred recycling.

When this recycling should be happening? It has been hours and couple of restarts (so no open transactions possible) and there's still four commit points there. I checked on other active servers and there are wide variety of numbers for commit points there - from 2 to 25. Not sure what that means.

I strongly encourage you to figure out what updates are causing this issue.

That might be not easy, given the above - I can see at which point the failure happened - i.e. I could guess the first item in the update but I can't know the exact content - there could be more or less items in the update, depending on various random network events. Also, since the failures happened on two servers with 5 hours difference, it can't be the same exact update - the probability that same exact items has been updated in same exact sequence with this interval is vanishingly small. I'd rather suspect some kind of limit or overflow in some storage or cache or something like that, or some "magic" value that when it's randomly hit produces the failure.

thompsonbry commented 5 years ago

Can you enable update logging on another server?

Recycling is how the rwstore puts slots from deleted data back into play. I rather doubt there is an issue with the recycling. I was point out the number of commit points available mainly in terms of options they might represent.

Recycling will not occur until a commit goes through. So it will remain at 4 commits until the server gets beyond this update issue.

On Mon, Jan 7, 2019 at 21:30 Stanislav Malyshev notifications@github.com wrote:

Do both servers see the same sequence of logical updates?

Not exactly. The update sequence is the same (it comes from the same event channel) but starting points may differ, so for example one server can have two batches of updates 1,2,3 and 4,5,6 and another can have 1,2 and 3,4,5,6, etc. The end result should be the same (almost, since we also store update timestamps per entry which may differ slightly) but the sequence may differ.

It could also be some internal limit not previously identified.

File sizes for both "bad" servers are 806G. But we have other servers with the same file size, and they are just fine.

Your other server has four commit points available. This means that it has not reclaimed some deleted data yet. This difference is likely due to some read transactions being open while the last updates went through, resulting in deferred recycling.

When this recycling should be happening? It has been hours and couple of restarts (so no open transactions possible) and there's still four commit points there. I checked on other active servers and there are wide variety of numbers for commit points there - from 2 to 25. Not sure what that means.

I strongly encourage you to figure out what updates are causing this issue.

That might be not easy, given the above - I can see at which point the failure happened - i.e. I could guess the first item in the update (though since I have only second resolution and there's 5-10 items updated per second, I'd already have uncertainty) but I can't know . Also, since the failures happened on two servers with 5 hours difference, it can't be the same exact update - the probability that same exact items has been updated in same exact sequence with this interval is vanishingly small. I'd rather suspect some kind of limit or overflow in some storage or cache or something like that, or some "magic" value that when it's randomly hit produces the failure.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/blazegraph/database/issues/114#issuecomment-452180491, or mute the thread https://github.com/notifications/unsubscribe-auth/ACdv4JyAyXbHZ1ysWJilYVD6mXxybwMaks5vBCzlgaJpZM4Z0i2I .

smalyshev commented 5 years ago

Can you enable update logging on another server?

Yes, but only for a short time (it generates a ton of data) and probably not today (since it needs access which I don't have to enable and people that have access are offline now - but we could do it tomorrow morning if needed). What you'd like to log though? The working servers are way past the point where failing servers failed now (and they didn't fail, obviously), so their data would be useless. I could of course log update for the failing servers, once we restore them, and see if this update reproduces the problem again, but as I said, I think it's very unlikely the content of the update matters too much, since two servers failed on different updates with 5 hour delta.

thompsonbry commented 5 years ago

Stas,

Here is a summary of our call this morning. First, the underlying root cause is that the maximum number of allocators has been reached. This limit is 256 * 1024 and is based on a 19-bit address as documented at the bottom of [1]. Each allocator manages a number of allocation slots (8k slots). For a given allocator, all those slots are the same size. Based on the dump journal reports, the vast majority of these allocators are for 64 and 128 byte allocation slots. Those allocation slots are being used for the Id2Term index to store the lexical form of the dictionary entry outside of the leaf pages as separate records on the journal. I have outlined several things which you can do to address this problem, but these will require a reload of the data.

  1. Inlining - is there anything else that can be inlined? Anything inlined will not be in the dictionary. A smaller dictionary will improve write performance.

  2. Ensure ID2TERM is NOT using raw records. This is extremely important. See https://github.com/blazegraph/database/blob/BLAZEGRAPH_RELEASE_2_1_5_RC/bigdata-core/bigdata-rdf/src/java/com/bigdata/rdf/lexicon/LexiconRelation.java#L1351 . This will radically reduce the #of 64 and 128 byte allocation slots in use since the data will be stored directly in the leaves of the ID2TERM index instead. This means that you will be able to have significantly more data in the backing RWStore instance since the allocators will be for larger slots corresponding to index pages rather than small slots corresponding to individual lexical items.

  3. Consider using INLINE_TEXT_LITERALS to force the inlining of strings into the statement indices. Inlined strings will not be entered into the dictionary at all (which will improve write performance but will also result in a somewhat larger footprint on the disk). Instead, the inlined strings will appear directly in the statement indices. Note that strings longer than 256 bytes (by default) go onto the BLOBS index rather than the Term2ID and ID2Term indices. See the following options in AbstractTripleStore.Options: {INLINE_TEXT_LITERALS, MAX_INLINE_TEXT_LENGTH, and BLOBS_THRESHOLD}. In many ways, this has the same effect as (2), but it also avoids dictionary entries entirely for smaller strings.

  4. Separate the two large data sets (wikidata and categories) into different Blazegraph instances (optional, adds flexibility)

  5. Ensure 32-bit pointers in Java (slightly less than 32G heaps) - this will improve the Java memory performance.

Thanks,

Bryan

[1] https://wiki.blazegraph.com/wiki/index.php/FixedAllocators

On Mon, Jan 7, 2019 at 9:54 PM Stanislav Malyshev notifications@github.com wrote:

Can you enable update logging on another server?

Yes, but only for a short time (it generates a ton of data) and probably not today (since it needs access which I don't have to enable and people that have access are offline now - but we could do it tomorrow morning if needed). What you'd like to log though? The working servers are way past the point where failing servers failed now (and they didn't fail, obviously), so their data would be useless. I could of course log update for the failing servers, once we restore them, and see if this update reproduces the problem again, but as I said, I think it's very unlikely the content of the update matters too much, since two servers failed on different updates with 5 hour delta.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/blazegraph/database/issues/114#issuecomment-452184305, or mute the thread https://github.com/notifications/unsubscribe-auth/ACdv4JTqtm7gxiL0jZg5j1ecdrz9byOaks5vBDKKgaJpZM4Z0i2I .

smalyshev commented 5 years ago

I tried to remove categories namespace, and it doesn't seem to have changed anything for allocators. Before: https://gist.github.com/smalyshev/19095c87b600900811e7f87d22ea250b After: https://gist.github.com/smalyshev/4d99b5258542e0705a335b52c7805d50

I wonder why is that - does anything else need to be done?

thompsonbry commented 5 years ago

The #of allocators will not change. What will change is that you will have free slots in the existing allocators.

The database needs to roll through a commit before it can free the slots.

I do see the #of recycled slots going up. I suggest trying a small commit or two (insert and then delete a single triple) to ensure that the recycler collects all the newly freed slots.

Bryan

On Tue, Jan 8, 2019 at 11:53 AM Stanislav Malyshev notifications@github.com wrote:

I tried to remove categories namespace, and it doesn't seem to have changed anything for allocators. Before: https://gist.github.com/smalyshev/19095c87b600900811e7f87d22ea250b After: https://gist.github.com/smalyshev/4d99b5258542e0705a335b52c7805d50

I wonder why is that - does anything else need to be done?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/blazegraph/database/issues/114#issuecomment-452430054, or mute the thread https://github.com/notifications/unsubscribe-auth/ACdv4BGoCugfmv4-UTyRbDp8Jr7KRilOks5vBPc4gaJpZM4Z0i2I .

webshark commented 2 years ago

Were you able to find a way to rollback a bad commit ? I have a fresh server that I am importing a Wikidata base in and mid way it stopped with the error:

Caused by: java.util.concurrent.ExecutionException: org.openrdf.query.UpdateExecutionException: java.lang.RuntimeException: Problem with entry at -668674172594945626: lastRootBlock=rootBlock{ rootBlock=1, challisField=835, version=3, nextOff set=62398251232785478, localTime=1661485728195 [Friday, August 26, 2022 at 3:48:48 AM Coordinated Universal Time], firstCommitTime=1658240159557 [Tuesday, July 19, 2022 at 2:15:59 PM Coordinated Universal Time], lastCommitTime=1661485714508 [Friday, August 26, 2022 at 3:48:34 AM Coordinated Universal Time], commitCounter=835, commitRecordAddr={off=NATIVE:-42932493,len=422}, commitRecordIndexAddr={off=NATIVE:-158345759,len=220}, blockSequence=27784, quorumToken=-1, metaBitsAddr= 47863639532635860, metaStartAddr=15961497, storeType=RW, uuid=f05859c7-286f-4e8e-8a91-593893119dcd, offsetBits=42, checksum=-586797382, createTime=1658240158542 [Tuesday, July 19, 2022 at 2:15:58 PM Coordinated Universal Time], closeTime=0}

I am now stuck in this position and can't continue

thompsonbry commented 2 years ago

Look here: https://blazegraph.com/database/apidocs/com/bigdata/journal/Options.html#ALTERNATE_ROOT_BLOCK

There are a few relevant options. You need to roll through a single commit after doing something like this to have both root blocks be valid again. Eg, a small write on the database, commit, then change the options back to what they were and restart.

You can read about the commit protocol on the wiki under the RWStore page. https://github.com/blazegraph/database/wiki/RWStore

Bryan

On Thu, Sep 1, 2022 at 06:45 Simon Gelfand @.***> wrote:

Were you able to find a way to rollback a bad commit ? I have a fresh server that I am importing a Wikidata base in and mid way it stopped with the error:

Caused by: java.util.concurrent.ExecutionException: org.openrdf.query.UpdateExecutionException: java.lang.RuntimeException: Problem with entry at -668674172594945626: lastRootBlock=rootBlock{ rootBlock=1, challisField=835, version=3, nextOff set=62398251232785478, localTime=1661485728195 [Friday, August 26, 2022 at 3:48:48 AM Coordinated Universal Time], firstCommitTime=1658240159557 [Tuesday, July 19, 2022 at 2:15:59 PM Coordinated Universal Time], lastCommitTime=1661485714508 [Friday, August 26, 2022 at 3:48:34 AM Coordinated Universal Time], commitCounter=835, commitRecordAddr={off=NATIVE:-42932493,len=422}, commitRecordIndexAddr={off=NATIVE:-158345759,len=220}, blockSequence=27784, quorumToken=-1, metaBitsAddr= 47863639532635860, metaStartAddr=15961497, storeType=RW, uuid=f05859c7-286f-4e8e-8a91-593893119dcd, offsetBits=42, checksum=-586797382, createTime=1658240158542 [Tuesday, July 19, 2022 at 2:15:58 PM Coordinated Universal Time], closeTime=0}

I am now stuck in this position and can't continue

— Reply to this email directly, view it on GitHub https://github.com/blazegraph/database/issues/114#issuecomment-1234303861, or unsubscribe https://github.com/notifications/unsubscribe-auth/AATW7YFRBUWE6KHXHGTLWFTV4CXQBANCNFSM4GOSFWEA . You are receiving this because you commented.Message ID: @.***>

webshark commented 2 years ago

So I did the following (maybe it was incorrect)

I stopped the Blazegraph server. Add this to config: com.bigdata.journal.Options=ALTERNATE_ROOT_BLOCK Started server.

Tried to import one file (I didn't know the syntax for update/commit) This failed in the end after a few hours with the same error as before:

Caused by: java.lang.RuntimeException: Problem with entry at -668674172594945626: lastRootBlock=rootBlock{ rootBlock=1, challisField=835, version=3, nextOffset=62398251232785478, localTime=1661485728195 [Friday, August 26, 2022 at 3:48:48 AM Coordinated Universal Time], firstCommitTime=1658240159557 [Tuesday, July 19, 2022 at 2:15:59 PM Coordinated Universal Time], lastCommitTime=1661485714508 [Friday, August 26, 2022 at 3:48:34 AM Coordinated Universal Time], commitCounter=835, commitRecordAddr={off=NATIVE:-42932493,len=422}, commitRecordIndexAddr={off=NATIVE:-158345759,len=220}, blockSequence=27784, quorumToken=-1, metaBitsAddr=47863639532635860, metaStartAddr=15961497, storeType=RW, uuid=f05859c7-286f-4e8e-8a91-593893119dcd, offsetBits=42, checksum=-586797382, createTime=1658240158542 [Tuesday, July 19, 2022 at 2:15:58 PM Coordinated Universal Time], closeTime=0}

Perhaps I misunderstood what to do.

thompsonbry commented 2 years ago
  1. The alternate root block should only be used for a single commit.
  2. I am not sure how the wikidata import breaks down into commits.
  3. There is something pretty odd going on.
  4. Can you provide a full backtrack for those error messages?

Overall, I might suspect a flakey hardware system. But I would like to see the full backtrack and see if that provides any clues.

Bryan

On Fri, Sep 2, 2022 at 02:37 Simon Gelfand @.***> wrote:

So I did the following (maybe it was incorrect)

I stopped the Blazegraph server. Add this to config: com.bigdata.journal.Options=ALTERNATE_ROOT_BLOCK Started server.

Tried to import one file (I didn't know the syntax for update/commit) This failed in the end after a few hours with the same error as before:

Caused by: java.lang.RuntimeException: Problem with entry at -668674172594945626: lastRootBlock=rootBlock{ rootBlock=1, challisField=835, version=3, nextOffset=62398251232785478, localTime=1661485728195 [Friday, August 26, 2022 at 3:48:48 AM Coordinated Universal Time], firstCommitTime=1658240159557 [Tuesday, July 19, 2022 at 2:15:59 PM Coordinated Universal Time], lastCommitTime=1661485714508 [Friday, August 26, 2022 at 3:48:34 AM Coordinated Universal Time], commitCounter=835, commitRecordAddr={off=NATIVE:-42932493,len=422}, commitRecordIndexAddr={off=NATIVE:-158345759,len=220}, blockSequence=27784, quorumToken=-1, metaBitsAddr=47863639532635860, metaStartAddr=15961497, storeType=RW, uuid=f05859c7-286f-4e8e-8a91-593893119dcd, offsetBits=42, checksum=-586797382, createTime=1658240158542 [Tuesday, July 19, 2022 at 2:15:58 PM Coordinated Universal Time], closeTime=0}

Perhaps I misunderstood what to do.

— Reply to this email directly, view it on GitHub https://github.com/blazegraph/database/issues/114#issuecomment-1235287002, or unsubscribe https://github.com/notifications/unsubscribe-auth/AATW7YCEANLCVC2B6SLC4H3V4HDE7ANCNFSM4GOSFWEA . You are receiving this because you commented.Message ID: @.***>

webshark commented 2 years ago

You can see a larger log: https://pastebin.com/JPCiBEvk

Do you have an example somewhere of a simple update/commit I could run ?

With Wikidata from what I was seeing it was running all the updates/inserts and at the end tried to commit the changes

Regarding hardware I don't think its that - its a new setup. This is the 3rd time its happening with me in an attempt to import the Wikidata dump (its huge and takes a long time)

thompsonbry commented 2 years ago

Ok. The root cause in that stack trace is a problem when trying to process deferred frees (recycling). I’ve clipped that part of the trace below. This is not a root block problem (for example, it is not a checksum issue with the root block). A copy of old root blocks is kept for the recycler and they are eventually pruned out along with the list of address that were recycled in the RWStore.

Your first post was at commit counter 833 if I recall in the error message. This one is at 835. But if these are not the root block used for commit, but instead root blocks used by the recycler, then I really can’t say anything about whether or not it is making progress.

I can not say offhand how things got into this state, but I also can’t see a way forward. I would advise reinstalling the wikidata setup from scratch, making sure that you clean out the old files and making sure that there is not a 2nd blazegraph process running (this should be disallowed by file locks). Then, if you hit any issue again please make sure to post the first error.

I would also be happy to take a peek at the first error message from the logs if you can find that. We might be able to find some indication of what went wrong. The log you posted clearly indicates an issue within the logged deferred frees and an inability to recycle those allocations.

com.bigdata.rwstore.RWStore.freeDeferrals(RWStore.java:5411) ... 21 more Caused by: java.lang.IllegalStateException: Error reading from WriteCache addr: 827585986560 length: 12, writeCacheDebug: No WriteCache debug info

Bryan

On Fri, Sep 2, 2022 at 06:29 Simon Gelfand @.***> wrote:

You can see a larger log: https://pastebin.com/JPCiBEvk

Do you have an example somewhere of a simple update/commit I could run ?

With Wikidata from what I was seeing it was running all the updates/inserts and at the end tried to commit the changes

Regarding hardware I don't think its that - its a new setup. This is the 3rd time its happening with me in an attempt to import the Wikidata dump (its huge and takes a long time)

— Reply to this email directly, view it on GitHub https://github.com/blazegraph/database/issues/114#issuecomment-1235507983, or unsubscribe https://github.com/notifications/unsubscribe-auth/AATW7YDE4RLL6FHXVRFK7H3V4H6LHANCNFSM4GOSFWEA . You are receiving this because you commented.Message ID: @.***>

webshark commented 2 years ago

ok ... sad to hear :( was running smoothly for 5 weeks and then crashed. I will create a fresh server - just to make sure no "junk" or "bad sectors" are left over

thompsonbry commented 2 years ago

If you do have the earliest error message in the log, I am happy to look at that. I've also copied Martyn Cutcher. He might be able to glean something from the message:

com.bigdata.rwstore.RWStore.freeDeferrals(RWStore.java:5411) ... 21 more Caused by: java.lang.IllegalStateException: Error reading from WriteCache addr: 827585986560 length: 12, writeCacheDebug: No WriteCache debug info

Bryan

On Fri, Sep 2, 2022 at 6:58 AM Simon Gelfand @.***> wrote:

ok ... sad to hear :( was running smoothly for 5 weeks and then crashed. I will create a fresh server - just to make sure no "junk" or "bad sectors" are left over

— Reply to this email directly, view it on GitHub https://github.com/blazegraph/database/issues/114#issuecomment-1235540214, or unsubscribe https://github.com/notifications/unsubscribe-auth/AATW7YGOAVQC657L7IFFWXLV4IBZXANCNFSM4GOSFWEA . You are receiving this because you commented.Message ID: @.***>

thompsonbry commented 2 years ago

On Fri, Sep 2, 2022 at 9:43 AM Bryan B. Thompson @.***> wrote:

If you do have the earliest error message in the log, I am happy to look at that. I've also copied Martyn Cutcher. He might be able to glean something from the message:

com.bigdata.rwstore.RWStore.freeDeferrals(RWStore.java:5411) ... 21 more Caused by: java.lang.IllegalStateException: Error reading from WriteCache addr: 827585986560 length: 12, writeCacheDebug: No WriteCache debug info

Bryan

On Fri, Sep 2, 2022 at 6:58 AM Simon Gelfand @.***> wrote:

ok ... sad to hear :( was running smoothly for 5 weeks and then crashed. I will create a fresh server - just to make sure no "junk" or "bad sectors" are left over

— Reply to this email directly, view it on GitHub https://github.com/blazegraph/database/issues/114#issuecomment-1235540214, or unsubscribe https://github.com/notifications/unsubscribe-auth/AATW7YGOAVQC657L7IFFWXLV4IBZXANCNFSM4GOSFWEA . You are receiving this because you commented.Message ID: @.***>

webshark commented 2 years ago

Problem is the script continues running until you stop it, so it kept on going failing over and over again