cznic / exp

github.com/cznic/exp has moved to modernc.org/exp
https://godoc.org/modernc.org/exp
18 stars 2 forks source link

lldb.remove panics #3

Closed tgulacsi closed 10 years ago

tgulacsi commented 10 years ago

Hi, I'm testing camlistore, and after some camput-ting, it panics:

2013/09/20 15:58:48 http: panic serving 127.0.0.1:35186: runtime error: invalid memory address or nil pointer dereference goroutine 1718 [running]: net/http.func·007() /usr/lib/go/src/pkg/net/http/server.go:1022 +0xac camlistore.org/pkg/blobserver/diskpacked.func·001() /home/tgulacsi/src/camlistore.org/tmp/build-gopath/src/camlistore.org/pkg/blobserver/diskpacked/diskpacked.go:339 +0xd7 camlistore.org/third_party/github.com/cznic/exp/lldb.(_lst).remove() /home/tgulacsi/src/camlistore.org/tmp/build-gopath/src/camlistore.org/third_party/github.com/cznic/exp/lldb/falloc.go:1861 +0xa camlistore.org/third_party/github.com/cznic/exp/lldb.(_lst).removeBack(0xc200163ab8, 0x1c) /home/tgulacsi/src/camlistore.org/tmp/build-gopath/src/camlistore.org/third_party/github.com/cznic/exp/lldb/falloc.go:1876 +0x2e camlistore.org/third_party/github.com/cznic/exp/lldb.(_Allocator).cadd(0xc2001639a0, 0xc2002c3e7b, 0x22, 0x22, 0x53, ...) /home/tgulacsi/src/camlistore.org/tmp/build-gopath/src/camlistore.org/third_party/github.com/cznic/exp/lldb/falloc.go:380 +0x10e camlistore.org/third_party/github.com/cznic/exp/lldb.(_Allocator).Alloc(0xc2001639a0, 0xc2002c3e7b, 0x22, 0x22, 0x53, ...) /home/tgulacsi/src/camlistore.org/tmp/build-gopath/src/camlistore.org/third_party/github.com/cznic/exp/lldb/falloc.go:420 +0x1f6 camlistore.org/third_party/github.com/cznic/exp/lldb.btreeDataPage.setContent(0xc200bf90cd, 0x2f8, 0x20151, 0xc200282bc0, 0xc2001639a0, ...) /home/tgulacsi/src/camlistore.org/tmp/build-gopath/src/camlistore.org/third_party/github.com/cznic/exp/lldb/btree.go:1272 +0x50a camlistore.org/third_party/github.com/cznic/exp/lldb.btreeDataPage.setKey(0xc200bf9000, 0x3c5, 0x2021e, 0xc200282bc0, 0xc2001639a0, ...) /home/tgulacsi/src/camlistore.org/tmp/build-gopath/src/camlistore.org/third_party/github.com/cznic/exp/lldb/btree.go:1308 +0xa9 camlistore.org/third_party/github.com/cznic/exp/lldb.btreeDataPage.insertItem(0xc200bf9000, 0x3c5, 0x2021e, 0xc200282bc0, 0xc2001639a0, ...) /home/tgulacsi/src/camlistore.org/tmp/build-gopath/src/camlistore.org/third_party/github.com/cznic/exp/lldb/btree.go:1350 +0x1aa camlistore.org/third_party/github.com/cznic/exp/lldb.btree.put2(0x1, 0xc200f04000, 0x1010f, 0x2021e, 0xc200282bc0, ...) /home/tgulacsi/src/camlistore.org/tmp/build-gopath/src/camlistore.org/third_party/github.com/cznic/exp/lldb/btree.go:1888 +0xecf camlistore.org/third_party/github.com/cznic/exp/lldb.btree.put(0x1, 0xc200f04000, 0x1010f, 0x2021e, 0xc200282bc0, ...) /home/tgulacsi/src/camlistore.org/tmp/build-gopath/src/camlistore.org/third_party/github.com/cznic/exp/lldb/btree.go:1784 +0x122 camlistore.org/third_party/github.com/cznic/exp/lldb.(_BTree).Set(0xc2001c4000, 0xc2002c3e70, 0x2d, 0x2d, 0xc200776140, ...) /home/tgulacsi/src/camlistore.org/tmp/build-gopath/src/camlistore.org/third_party/github.com/cznic/exp/lldb/btree.go:429 +0x1ca camlistore.org/third_party/github.com/cznic/kv.(_DB).Set(0xc200176a50, 0xc2002c3e70, 0x2d, 0x2d, 0xc200776140, ...) /home/tgulacsi/src/camlistore.org/tmp/build-gopath/src/camlistore.org/third_party/github.com/cznic/kv/kv.go:756 +0xcb camlistore.org/pkg/index/kvfile.(_kvis).Set(0xc20011f790, 0xc2002c3e40, 0x2d, 0xc200776120, 0x11, ...) /home/tgulacsi/src/camlistore.org/tmp/build-gopath/src/camlistore.org/pkg/index/kvfile/kvfile.go:84 +0xc3 camlistore.org/pkg/blobserver/diskpacked.(_storage).finishAppend(0xc200181510, 0xc200231940, 0xc2007763c0, 0x12523, 0x12523, ...) /home/tgulacsi/src/camlistore.org/tmp/build-gopath/src/camlistore.org/pkg/blobserver/diskpacked/diskpacked.go:342 +0x28e camlistore.org/pkg/blobserver/diskpacked.(_storage).append(0xc2001566c0, 0xc200231940, 0xc2007763c0, 0x12523, 0xc20056b5a0, ...) /home/tgulacsi/src/camlistore.org/tmp/build-gopath/src/camlistore.org/pkg/blobserver/diskpacked/diskpacked.go:385 +0x5d2 camlistore.org/pkg/blobserver/diskpacked.(_storage).ReceiveBlob(0xc2001566c0, 0xc200231940, 0xc2007763c0, 0xc20056bbd0, 0xc2003c57e0, ...) /home/tgulacsi/src/camlistore.org/tmp/build-gopath/src/camlistore.org/pkg/blobserver/diskpacked/diskpacked.go:293 +0x1a0

It seems that l.back is nil. Any ideas?

tgulacsi commented 10 years ago

I don't know whether the real issue is in lldb, or in the usage (camlistore), but this skips the panic:

diff --git a/third_party/github.com/cznic/exp/lldb/falloc.go b/third_party/github.com/cznic/exp/lldb/falloc.go
index 103a469..01c13b8 100644                                              
--- a/third_party/github.com/cznic/exp/lldb/falloc.go                      
+++ b/third_party/github.com/cznic/exp/lldb/falloc.go                      
@@ -377,7 +377,10 @@ func (a *Allocator) cadd(b []byte, h int64) {         
    }                                                                      

    // cache full                                                          
-   delete(a.m, a.cache.put(a.lru.removeBack()).h)                         
+   tbd := a.cache.put(a.lru.removeBack())                                 
+   if tbd != nil {                                                        
+       delete(a.m, tbd.h)                                                 
+   }                                                                      
    n := a.cache.get(len(b))                                               
    n.h = h                                                                
    copy(n.b, b)                                                           
@@ -1831,6 +1834,9 @@ func (c *cache) size() (sz int64) {                  
 }                                                                         

 func (c *cache) put(n *node) *node {                                      
+   if n == nil {                                                          
+       return n                                                           
+   }                                                                      
    s := *c                                                                
    n.b = n.b[:cap(n.b)]                                                   
    lenb := len(n.b)                                                       
@@ -1848,6 +1854,9 @@ type lst struct {                                    
 }                                                                         

 func (l *lst) pushFront(n *node) *node {                                  
+   if n == nil {                                                          
+       return n                                                           
+   }                                                                      
    if l.front == nil {                                                    
        l.front, l.back, n.prev, n.next = n, n, nil, nil                   
        return n                                                           
@@ -1858,6 +1867,9 @@ func (l *lst) pushFront(n *node) *node {             
 }                                                                         

 func (l *lst) remove(n *node) *node {                                     
+   if n == nil {                                                          
+       return n                                                           
+   }                                                                      
    if n.prev == nil {                                                     
        l.front = n.next                                                   
    } else {                                                               
cznic commented 10 years ago

I'm bit confused by the stack trace. The panic is (I suppose normally) in the top most line, isn't it? But that's not in 'lldb'.

Wrt the 2nd message, please send me the patch by mail (plain text or as attachment). Github (de)formats it in a way which I'm not sure how to interpret correctly, so if it fixes a bug I cannot see it now.

Also a reproduction case is probably needed for me to investigate more.

I'll try to have a look on this. Probably not before tomorrow evening. Leaving office soon and have some family program. Anyway, I expect that the fix will come hopefully during this weekend.

tgulacsi commented 10 years ago

It is from a (re)panic to log that the writer and reader I use is not nil. I'll send the patch - but is not a big thing, just "if n == nil { return n }" wherever n *node is an argument.

2013/9/20 cznic notifications@github.com

I'm bit confused by the stack trace. The panic is (I suppose normally) in the top most line, isn't it? But that's not in 'lldb'.

Wrt the 2nd message, please send me the patch by mail (plain text or as attachment. Github (de)formats it in a way which I'm not sure how to interpret correctly, so if it fixes a bug I cannot see it now.

Also a reproduction case is probably needed for me to investigate more.

I'll try to have a look on this. Probably not before tomorrow evening. Leaving office soon and had some family program. Anyway, I expect that the fix will come hopefully during this weekend.

— Reply to this email directly or view it on GitHubhttps://github.com/cznic/exp/issues/3#issuecomment-24813263 .

tgulacsi commented 10 years ago

Ok, I've attached a minimal patch which fixes it, and here is the raw panic:

2013/09/20 21:36:25 http: panic serving 127.0.0.1:60043: runtime error: invalid memory address or nil pointer dereference goroutine 1756 [running]: net/http.func·007()

    /usr/lib/go/src/pkg/net/http/server.go:1022

+0xac camlistore.org/third_party/github.com/cznic/exp/lldb.(*lst).remove()

    /home/gthomas/src/

camlistore.org/tmp/build-gopath/src/camlistore.org/third_party/github.com/cznic/exp/lldb/falloc.go:1861+0xa camlistore.org/third_party/github.com/cznic/exp/lldb.(_lst).removeBack(0xc200298118, 0x7d) /home/gthomas/src/ camlistore.org/tmp/build-gopath/src/camlistore.org/third_party/github.com/cznic/exp/lldb/falloc.go:1876+0x2e camlistore.org/third_party/github.com/cznic/exp/lldb.(_Allocator).cadd(0xc200298000, 0xc2003d1b1b, 0x22, 0x22, 0x12e, ...) /home/gthomas/src/ camlistore.org/tmp/build-gopath/src/camlistore.org/third_party/github.com/cznic/exp/lldb/falloc.go:380+0x10e camlistore.org/third_party/github.com/cznic/exp/lldb.(_Allocator).Alloc(0xc200298000, 0xc2003d1b1b, 0x22, 0x22, 0x12e, ...) /home/gthomas/src/ camlistore.org/tmp/build-gopath/src/camlistore.org/third_party/github.com/cznic/exp/lldb/falloc.go:420+0x1f6 camlistore.org/third_party/github.com/cznic/exp/lldb.btreeDataPage.setContent(0xc20088264b, 0x804, 0x1fbd3, 0xc20028b040, 0xc200298000, ...) /home/gthomas/src/ camlistore.org/tmp/build-gopath/src/camlistore.org/third_party/github.com/cznic/exp/lldb/btree.go:1272+0x50a camlistore.org/third_party/github.com/cznic/exp/lldb.btreeDataPage.setKey(0xc200882000, 0xe4f, 0x2021e, 0xc20028b040, 0xc200298000, ...) /home/gthomas/src/ camlistore.org/tmp/build-gopath/src/camlistore.org/third_party/github.com/cznic/exp/lldb/btree.go:1308+0xa9 camlistore.org/third_party/github.com/cznic/exp/lldb.btreeDataPage.insertItem(0xc200882000, 0xe4f, 0x2021e, 0xc20028b040, 0xc200298000, ...) /home/gthomas/src/ camlistore.org/tmp/build-gopath/src/camlistore.org/third_party/github.com/cznic/exp/lldb/btree.go:1350+0x1aa camlistore.org/third_party/github.com/cznic/exp/lldb.btree.put2(0x1, 0xc200f26000, 0x1010f, 0x2021e, 0xc20028b040, ...) /home/gthomas/src/ camlistore.org/tmp/build-gopath/src/camlistore.org/third_party/github.com/cznic/exp/lldb/btree.go:1888+0xecf camlistore.org/third_party/github.com/cznic/exp/lldb.btree.put(0x1, 0xc200f26000, 0x1010f, 0x2021e, 0xc20028b040, ...) /home/gthomas/src/ camlistore.org/tmp/build-gopath/src/camlistore.org/third_party/github.com/cznic/exp/lldb/btree.go:1784+0x122 camlistore.org/third_party/github.com/cznic/exp/lldb.(_BTree).Set(0xc200294210, 0xc2003d1b10, 0x2d, 0x2d, 0xc2003d4b40, ...) /home/gthomas/src/ camlistore.org/tmp/build-gopath/src/camlistore.org/third_party/github.com/cznic/exp/lldb/btree.go:429+0x1ca camlistore.org/third_party/github.com/cznic/kv.(_DB).Set(0xc200296000, 0xc2003d1b10, 0x2d, 0x2d, 0xc2003d4b40, ...) /home/gthomas/src/ camlistore.org/tmp/build-gopath/src/camlistore.org/third_party/github.com/cznic/kv/kv.go:756+0xcb camlistore.org/pkg/index/kvfile.(_kvis).Set(0xc200297040, 0xc2003d1ab0, 0x2d, 0xc2003d4b20, 0x11, ...) /home/gthomas/src/ camlistore.org/tmp/build-gopath/src/camlistore.org/pkg/index/kvfile/kvfile.go:84+0xc3 camlistore.org/pkg/blobserver/diskpacked.(_storage).finishAppend(0xc20028c7e0, 0xc20024a980, 0xc2006b9160, 0x11bb5, 0x11bb5, ...) /home/gthomas/src/ camlistore.org/tmp/build-gopath/src/camlistore.org/pkg/blobserver/diskpacked/diskpacked.go:324+0x31f camlistore.org/pkg/blobserver/diskpacked.(_storage).append(0xc20028c120, 0xc20024a980, 0xc2006b9160, 0x11bb5, 0xc20078e180, ...) /home/gthomas/src/ camlistore.org/tmp/build-gopath/src/camlistore.org/pkg/blobserver/diskpacked/diskpacked.go:373+0x605 camlistore.org/pkg/blobserver/diskpacked.(*storage).ReceiveBlob(0xc20028c120, 0xc20024a980, 0xc2006b9160, 0xc20078e600, 0xc20045a690, ...) /home/gthomas/src/ camlistore.org/tmp/build-gopath/src/camlistore.org/pkg/blobserver/diskpacked/diskpacked.go:280+0x1a0 camlistore.org/pkg/blobserver.receive(0xc20078e8a0, 0xc20028c120, 0xc20024a980, 0xc2006b9160, 0xc20078e780, ...) /home/gthomas/src/ camlistore.org/tmp/build-gopath/src/camlistore.org/pkg/blobserver/receive.go:41+0x131 camlistore.org/pkg/blobserver.Receive(0xc20078e8a0, 0xc20028c120, 0xc20024a980, 0xc2006b9160, 0xc2 2013/09/20 21:36:25 setting [sha1-80dcdd5dd385a31b49eac79bbb14b44e85581f38; 73634 bytes] in /home/gthomas/var/camlistore/packs 2013/09/20 21:36:25 setting [sha1-80dcdd5dd385a31b49eac79bbb14b44e85581f38; 73634 bytes] in /home/gthomas/var/camlistore/packs/index-sqlite

2013/9/20 Tamás Gulácsi gt-dev@gthomas.homelinux.org

It is from a (re)panic to log that the writer and reader I use is not nil. I'll send the patch - but is not a big thing, just "if n == nil { return n }" wherever n *node is an argument.

2013/9/20 cznic notifications@github.com

I'm bit confused by the stack trace. The panic is (I suppose normally) in the top most line, isn't it? But that's not in 'lldb'.

Wrt the 2nd message, please send me the patch by mail (plain text or as attachment. Github (de)formats it in a way which I'm not sure how to interpret correctly, so if it fixes a bug I cannot see it now.

Also a reproduction case is probably needed for me to investigate more.

I'll try to have a look on this. Probably not before tomorrow evening. Leaving office soon and had some family program. Anyway, I expect that the fix will come hopefully during this weekend.

— Reply to this email directly or view it on GitHubhttps://github.com/cznic/exp/issues/3#issuecomment-24813263 .

cznic commented 10 years ago

A reproduction case is needed. I want to write a test triggering the failure even before fixing it. I'll try to figure it out somehow, but if there are "steps to reproduce", it would be fantastic ;-)

tgulacsi commented 10 years ago

I cannot reproduce it... Maybe a false alarm?

How could the a.lru.back be nil? (a.lru.removeBack(), line 380 in falloc.go).

2013/9/21 cznic notifications@github.com

A reproduction case is needed. I want to write a test triggering the failure even before fixing it. I'll try to figure it out somehow, but if there are "steps to reproduce", it would be fantastic ;-)

— Reply to this email directly or view it on GitHubhttps://github.com/cznic/exp/issues/3#issuecomment-24858149 .

cznic commented 10 years ago

I put a lot of nil checking panics and ran many test for most of the weekend and was not able to reproduce it neither. Moreover, I agree that the OP stack trace is a mystery - to me at least. The 'removeBack' is called only when the cache is full, ie nonempty ie back != nil.

There's always some nonzero possibility of a RAM/CPU fault and I cannot figure out any other way how it could have happened.

Of course, the "cannot reproduce" does not mean a bug has been ruled out for sure.

I'll keep this issue open for some timeout (say a week or so). Please let me know if it happens again. Would be a important indicator if that would happen on a diferent machine than in the first observed case.

tgulacsi commented 10 years ago

Thanks for your testing! I'll report iff I can reproduce that panic again.

2013/9/23 cznic notifications@github.com

I put a lot of nil checking panics and ran many test for most of the weekend and was not able to reproduce it neither. Moreover, I agree that the OP stack trace is a mystery - to me at least. The 'removeBack' is called only when the cache is full, ie nonempty ie back != nil.

There's always some nonzero possibility of a RAM/CPU fault and I cannot figure out any other way how it could have happened.

Of course, the "cannot reproduce" does not mean a bug has been ruled out for sure.

I'll keep this issue open for some timeout (say a week or so). Please let me know if it happens again. Would be a important indicator if that would happen on a diferent machine than in the first observed case.

— Reply to this email directly or view it on GitHubhttps://github.com/cznic/exp/issues/3#issuecomment-24903606 .

cznic commented 10 years ago

@tgulacsi Closing for "timeout". However, please note that there was one bug, not known to me if related to this "heisenbug" issue, fixed recently in lldb 0. I suggest to sync your lldb repo if you haven't already done that.

tgulacsi commented 10 years ago

Ok, thanks!

2013/11/5 cznic notifications@github.com

@tgulacsi https://github.com/tgulacsi Closing for "timeout". However, please note that there was one bug, not known to me if related to this "heisenbug" issue, fixed recently in lldb 0https://github.com/cznic/exp/commit/44f1cc3f1eb4626d0ae89dfdb3fe6c7a888a0171. I suggest to sync your lldb repo if you haven't already done that.

— Reply to this email directly or view it on GitHubhttps://github.com/cznic/exp/issues/3#issuecomment-27755206 .

mpl commented 10 years ago

Hi Jan,

as far as I can see, we do have https://github.com/cznic/exp/commit/44f1cc3f1eb4626d0ae89dfdb3fe6c7a888a0171 integrated in our third_party, but we're still seeing this panic.

Please let us know if we can provide more info to help debugging that.

Cheers, Mathieu

cznic commented 10 years ago

@mpl

I spent half of yesterday trying to kill this bug to no avail. Of course, the quest will continue. All I need is a repro case and some coffee ;-)

tgulacsi commented 10 years ago

Use camlistore with diskpacked backend (example config:

{ "listen": "0.0.0.0:3179", "shareHandler": true, "https": false, "identity": "974EA38B", "identitySecretRing": "/home/gthomas/.config/camlistore/identity-secring.gpg", "kvIndexFile": "/home/gthomas/var/camlistore/camli-index.kvdb", "blobPath": "/home/gthomas/var/camlistore/packs", "packBlobs": true, "mysql": "", "mongo": "", "postgres": "", "sqlite": "", "s3": "", "replicateTo": [], }

and camput file --permanode ~/src/camlistore.org

(lot of files, and lldb panics on remove)

cznic commented 10 years ago

Thanks! Will dive into it tomorrow morning (CET), I'm just about to leave the office for today.

mpl commented 10 years ago

oh, I thought you needed a lower-level repro case than that. If you can use camlistore/camput then yes, it's pretty trivial to reproduce. I don't even think I'm using diskpacked.

cznic commented 10 years ago

Yep, I was always trying to trigger the crash by exercising kv in tests to no avail. Never tried camlistore directly (TBH don't even know how).

bradfitz commented 10 years ago

I hit the crash reliably without diskpacked (which is used if you say "packBlobs": true, which I don't have listed).

cznic commented 10 years ago

@all

FYI: Just managed to trigger the crash locally for the first time; using the method provided by @tgulacsi . Thanks!

cznic commented 10 years ago

Reopening, fix will follow.

bradfitz commented 10 years ago

Excellent! I look forward to it.

tgulacsi commented 10 years ago

Great!

But just to understand: where is that cacheAudit (will be) called? I do see it in the test, but will that solve anything?

cznic commented 10 years ago

The purpose of cacheAudit is 1) to enable a reproducible failing case (commit 69a1875), 2) to prove that the fix (commit bb1bc28) really solves the problem. The audit is enabled at 0, ie. only when testing, and called at 1

The fix per se doesn't use cacheAudit. The fix was to get rid of a very stupid premature optimization in Realloc which screwed horribly the cache invariants. It's surprising that it worked at all - even if only sometimes.

mpl commented 10 years ago

Brilliant. Thanks for the quick work Jan.

tgulacsi commented 10 years ago

Thanks Jan, [camlistored] now works without hiccups!