filecoin-project / eudico

lotus, but also other things
Other
19 stars 14 forks source link

Data race in subnet manager's exchange protocol #235

Closed dnkolegov closed 2 years ago

dnkolegov commented 2 years ago

Describe the Bug

Data race: read at eudico/chain/consensus/hierarchical/subnet/submgr/exchange.go:62ref and write at eudico/chain/consensus/hierarchical/subnet/submgr/exchange.go:69 ref

Logging Information

==================
WARNING: DATA RACE
Read at 0x00c0051b1350 by goroutine 1318:
  runtime.mapaccess1_faststr()
      /opt/homebrew/Cellar/go/1.18.2/libexec/src/runtime/map_faststr.go:13 +0x41c
  github.com/filecoin-project/lotus/chain/consensus/hierarchical/subnet/submgr.(*Subnet).helloBack()
      /Users/alpha/Projects/eudico/chain/consensus/hierarchical/subnet/submgr/exchange.go:62 +0x1c8
  github.com/filecoin-project/lotus/chain/consensus/hierarchical/subnet/submgr.(*Subnet).handleHelloStream()
      /Users/alpha/Projects/eudico/chain/consensus/hierarchical/subnet/submgr/exchange.go:193 +0x634
  github.com/filecoin-project/lotus/chain/consensus/hierarchical/subnet/submgr.(*Subnet).handleHelloStream-fm()
      <autogenerated>:1 +0x48
  github.com/libp2p/go-libp2p/p2p/host/basic.(*BasicHost).SetStreamHandler.func1()
      /Users/alpha/go/pkg/mod/github.com/libp2p/go-libp2p@v0.19.3/p2p/host/basic/basic_host.go:573 +0x80
  github.com/libp2p/go-libp2p/p2p/host/basic.(*BasicHost).newStreamHandler.func1()
      /Users/alpha/go/pkg/mod/github.com/libp2p/go-libp2p@v0.19.3/p2p/host/basic/basic_host.go:416 +0x70

Previous write at 0x00c0051b1350 by goroutine 1335:
  runtime.mapaccess2_faststr()
      /opt/homebrew/Cellar/go/1.18.2/libexec/src/runtime/map_faststr.go:108 +0x43c
  github.com/filecoin-project/lotus/chain/consensus/hierarchical/subnet/submgr.(*Subnet).helloBack()
      /Users/alpha/Projects/eudico/chain/consensus/hierarchical/subnet/submgr/exchange.go:69 +0x230
  github.com/filecoin-project/lotus/chain/consensus/hierarchical/subnet/submgr.(*Subnet).handleHelloStream()
      /Users/alpha/Projects/eudico/chain/consensus/hierarchical/subnet/submgr/exchange.go:193 +0x634
  github.com/filecoin-project/lotus/chain/consensus/hierarchical/subnet/submgr.(*Subnet).handleHelloStream-fm()
      <autogenerated>:1 +0x48
  github.com/libp2p/go-libp2p/p2p/host/basic.(*BasicHost).SetStreamHandler.func1()
      /Users/alpha/go/pkg/mod/github.com/libp2p/go-libp2p@v0.19.3/p2p/host/basic/basic_host.go:573 +0x80
  github.com/libp2p/go-libp2p/p2p/host/basic.(*BasicHost).newStreamHandler.func1()
      /Users/alpha/go/pkg/mod/github.com/libp2p/go-libp2p@v0.19.3/p2p/host/basic/basic_host.go:416 +0x70

Goroutine 1318 (running) created at:
  github.com/libp2p/go-libp2p/p2p/host/basic.(*BasicHost).newStreamHandler()
      /Users/alpha/go/pkg/mod/github.com/libp2p/go-libp2p@v0.19.3/p2p/host/basic/basic_host.go:416 +0x6f4
  github.com/libp2p/go-libp2p/p2p/host/basic.(*BasicHost).newStreamHandler-fm()
      <autogenerated>:1 +0x48
  github.com/libp2p/go-libp2p/p2p/net/mock.(*peernet).handleNewStream.func1()
      /Users/alpha/go/pkg/mod/github.com/libp2p/go-libp2p@v0.19.3/p2p/net/mock/mock_peernet.go:90 +0x58

Goroutine 1335 (running) created at:
  github.com/libp2p/go-libp2p/p2p/host/basic.(*BasicHost).newStreamHandler()
      /Users/alpha/go/pkg/mod/github.com/libp2p/go-libp2p@v0.19.3/p2p/host/basic/basic_host.go:416 +0x6f4
  github.com/libp2p/go-libp2p/p2p/host/basic.(*BasicHost).newStreamHandler-fm()
      <autogenerated>:1 +0x48
  github.com/libp2p/go-libp2p/p2p/net/mock.(*peernet).handleNewStream.func1()
      /Users/alpha/go/pkg/mod/github.com/libp2p/go-libp2p@v0.19.3/p2p/net/mock/mock_peernet.go:90 +0x58
==================

Repo Steps

No response

dnkolegov commented 2 years ago

The bug is reproduced when Mir runs in the rootnet and in a subnet on the same machine with 4 eudico full nodes. After my investigation, I found two things.

First, is that there is the case when a user can run the HC in the following setting: /root -> EC /root/A ->Mir /root/A/B ->Mir In this case, peer ID will be the same in both subnets, and a data race can happen.

The second is that I changed the key mutex to the traditional mutex and the bug disappeared. If I change Keymutex to github.com/im7mortal/kmutex then the bug is still reproduced. So despite the fact that keymutex is suitable here, the race detector doesn't know about that semantics and will report a bug.

I would suggest using sync.Map or sync.RWMutex.

P.S. In the Mir reconfiguration draft I have implemented sync.Map.

@adlrocha what do you think?

dnkolegov commented 2 years ago

I am suggesting this fix - https://github.com/filecoin-project/eudico/pull/234/commits/bdd2590ae5258ddd5fddef0f26734b4483a38aed#diff-100f0122618ed5b5f4e40838698391b594f842011d057c947a39e5cdf1ec92b2

@adlrocha what do you think?

adlrocha commented 2 years ago

Yes, I think this definitely fixes it. I completely missed that race. Good catch, thanks! Feel free to close this issue once the PR is merged.

adlrocha commented 2 years ago

Fixed in #234