celo-org / celo-blockchain

Official repository for the golang Celo Blockchain
https://celo.org
GNU Lesser General Public License v3.0
560 stars 198 forks source link

Seg fault / panic syncing mainnet on v1.7.4 #2134

Closed aaronmboyd closed 3 days ago

aaronmboyd commented 1 year ago

Expected Behavior

Sync from genesis without panic

Actual Behavior

Panic and geth terminated. Resumes syncing normally after restart.

Steps to reproduce the behavior

Build celo-blockchain from source tag v1.7.4 Go: go1.19.2.linux-arm64, dependency in custom Dockfile Docker: 24.0.2 Running on: Ubuntu 22.04.2 LTS ARM64

Dockerfile:

FROM arm64v8/ubuntu:jammy as builder

RUN apt update && apt install gcc make musl-dev wget -y

RUN wget https://go.dev/dl/go1.19.2.linux-arm64.tar.gz
RUN rm -rf /usr/local/go && tar -C /usr/local -xzf go1.19.2.linux-arm64.tar.gz
ENV PATH $PATH:/usr/local/go/bin

ADD . /go-ethereum
WORKDIR /go-ethereum
RUN make geth-musl

FROM arm64v8/ubuntu:jammy
ARG COMMIT_SHA

COPY --from=builder /go-ethereum/build/bin/geth /usr/local/bin/
RUN echo $COMMIT_SHA > /version.txt
ADD scripts/run_geth_in_docker.sh /

EXPOSE 8545 8546 30303 30303/udp
ENTRYPOINT ["sh", "/run_geth_in_docker.sh"]

# Add some metadata labels to help programatic image consumption
ARG COMMIT=$COMMIT_SHA
ARG VERSION=""
ARG BUILDNUM=""

LABEL commit="$COMMIT" version="$VERSION" buildnum="$BUILDNUM"

Backtrace

INFO [06-19\|13:17:04.423] Imported new chain segment               blocks=190  txs=1079  mgas=243.352 elapsed=8.336s    mgasps=29.192  number=18,830,021 hash=234037..f78e82 age=2mo2d1h  dirty=74.86MiB
--
INFO [06-19\|13:17:07.629] Sending val enode share msg to proxy     func=sendValEnodeShareMsgs       proxy peer="Peer 16331f68e399c8af 10.192.11.7:30503" valAddresses length=119
INFO [06-19\|13:17:07.630] Skipping sending ValEnodesShareMsg b/c not validating func=sendValEnodesShareMsg
INFO [06-19\|13:17:12.431] Imported new chain segment               blocks=169  txs=889   mgas=206.133 elapsed=8.008s    mgasps=25.740  number=18,830,190 hash=3cd7bb..1c429a age=2mo2d1h  dirty=75.76MiB
INFO [06-19\|13:17:20.441] Imported new chain segment               blocks=221  txs=1171  mgas=267.872 elapsed=8.010s    mgasps=33.439  number=18,830,411 hash=570d7c..095d89 age=2mo2d59m dirty=76.94MiB
INFO [06-19\|13:17:28.969] Imported new chain segment               blocks=199  txs=820   mgas=223.748 elapsed=8.527s    mgasps=26.240  number=18,830,610 hash=63997b..6b01d5 age=2mo2d43m dirty=76.46MiB
INFO [06-19\|13:17:36.971] Imported new chain segment               blocks=145  txs=573   mgas=211.915 elapsed=8.002s    mgasps=26.481  number=18,830,755 hash=e204e2..de3f67 age=2mo2d31m dirty=77.38MiB
INFO [06-19\|13:17:37.634] Sending val enode share msg to proxy     func=sendValEnodeShareMsgs       proxy peer="Peer 16331f68e399c8af 10.192.11.7:30503" valAddresses length=119
INFO [06-19\|13:17:37.634] Skipping sending ValEnodesShareMsg b/c not validating func=sendValEnodesShareMsg
INFO [06-19\|13:17:42.725] Imported new chain segment               blocks=245  txs=1128  mgas=244.124 elapsed=5.753s    mgasps=42.428  number=18,831,000 hash=6af2e0..cfc589 age=2mo2d11m dirty=79.62MiB
INFO [06-19\|13:17:42.726] Downloader queue stats                   receiptTasks=0 blockTasks=33151 itemSize=1.83KiB  throttle=8192
INFO [06-19\|13:17:43.728] Unindexing transactions                  blocks=1727 txs=39667 total=2048 elapsed=1.002s
INFO [06-19\|13:17:44.066] Unindexed transactions                   blocks=2048 txs=47208 tail=16,481,001 elapsed=1.340s
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xc44908]
goroutine 574387 [running]:
github.com/celo-org/celo-blockchain/contracts/currency.(*Currency).ToCELO(...)
github.com/celo-org/celo-blockchain/contracts/currency/currency.go:62
github.com/celo-org/celo-blockchain/miner.createConversionFunctions.func2(0x3c831c0a?, 0x4011eba400?)
github.com/celo-org/celo-blockchain/miner/block.go:380 +0x28
github.com/celo-org/celo-blockchain/core/types.NewTxWithMinerFee(0x401328e120, 0x4091e8fbd0, 0x4091e8fbe0)
github.com/celo-org/celo-blockchain/core/types/transaction.go:555 +0x88
github.com/celo-org/celo-blockchain/core/types.NewTransactionsByPriceAndNonce({0x1914680?, 0x405c819bb0}, 0x409b8fe1e0, 0x4091e8fbd0, 0x4091e8fbe0)
github.com/celo-org/celo-blockchain/core/types/transaction.go:609 +0x168
github.com/celo-org/celo-blockchain/miner.(*worker).constructPendingStateBlock(0x40029794a0, {0x1911e58, 0x403d4b1d40}, 0x40034d0a80)
github.com/celo-org/celo-blockchain/miner/worker.go:366 +0x408
github.com/celo-org/celo-blockchain/miner.(*worker).mainLoop.func1.2()
github.com/celo-org/celo-blockchain/miner/worker.go:415 +0x38
created by github.com/celo-org/celo-blockchain/miner.(*worker).mainLoop.func1
github.com/celo-org/celo-blockchain/miner/worker.go:414 +0x1f4

Chain/Network: Mainnet

aaronmboyd commented 1 year ago

Possibly related to #1982

karlb commented 1 year ago

Maybe we should not silently discard the error in https://github.com/celo-org/celo-blockchain/blob/85c175776055fc739fb179d9426d7b158e0c3644/miner/block.go#L386

palango commented 1 year ago

This involves this function: https://github.com/celo-org/celo-blockchain/blob/de3f84b741d3472846a05754d49910e1bceb54dd/core/types/transaction.go#L604-L628

karlb commented 1 year ago

From reading the stack trace, I see only one way to cause the error and that is

However, I don't see this being the case for any of the fee currencies around that time:

for curr in 0x765DE816845861e75A25fCA122bb6898B8B1282a 0xD8763CBa276a3738E6DE85b4b3bF5FDed6D6cA73 0xe8537a3d056DA446677B9E9d6c5dB704EaAb4787
    cast call -r https://celo-mainnet.infura.io/v3/$INFURA_API_KEY 0xefB849352
39dAcdecF7c5bA76d8dE40b077B7b33 'function medianRate(address token) external view returns (uint256, uint256)' $curr -b 18831000
end
713715994804191731790000
1000000000000000000000000
653368356238021788240000
1000000000000000000000000
3553065332252728052300000
1000000000000000000000000

Have I missed anything?

karlb commented 1 year ago

I was unable to reproduce the panic with v1.7.4, but I just ran it locally and didn't use the docker container as described in the issue.

We could easily handle the case and avoid a panic (e.g. by not processing a tx when we can't convert the miner fee to CELO). But since it does not happen for all validators, we would get a consensus failure instead of a panic, which would be even harder to debug.

Alternatively, we could keep panicking and add some additional debugging output to show the specific tx that fails when it happens the next time.

aaronmboyd commented 3 days ago

Closing stale, and I believe this was resolved, although I don't think it was specifically reproducable on ARM.