erigontech / erigon

Ethereum implementation on the efficiency frontier https://erigon.gitbook.io
GNU Lesser General Public License v3.0
3.13k stars 1.11k forks source link

erigon eats 100gb+ of memory when tracing a certain tx #4637

Closed banteg closed 2 years ago

banteg commented 2 years ago

System information

Erigon version: erigon version 2022.07.1-alpha-09776394

OS & Version: Linux

Commit hash : 0977639431fe520fc77399d03cdeba36526d2d52

Expected behaviour

an rpc call returns a trace

Actual behaviour

erigon gobbles up 100gb+ or memory and gets killed by the system

Steps to reproduce the behaviour

run debug_traceTransaction against any of these txs:

Backtrace

not available, erigon gets killed by the system

mandrigin commented 2 years ago

@banteg is there anything more fresh that has this or similar behaviour? I have a pruned node, so I can't check that far back in history. Or similar transactions that aren't 1/2 of a year old are tracing just fine?

MysticRyuujin commented 2 years ago

This is also affecting the latest stable/beta/depreciated release.

{"jsonrpc":"2.0","id":1,"method":"debug_traceTransaction","params":["0xb9e6b6f275212824215e8f50818f12b37b7ca4c2e0b943785357c35b23743b94"]}
AlexeyAkhunov commented 2 years ago

This is because of this PR: https://github.com/ledgerwatch/erigon/pull/2779

mandrigin commented 2 years ago

ah, okay, then we probably need to think about adding some kind of pagination/limitation for these traces, or some binary response

mandrigin commented 2 years ago

I wonder also if another json-serialization lib could work there, the marshal/json isn't the most frugal code

banteg commented 2 years ago

@banteg is there anything more fresh that has this or similar behaviour? I have a pruned node, so I can't check that far back in history. Or similar transactions that aren't 1/2 of a year old are tracing just fine?

no, my dataset consisted of 11,000 transactions and only these three had this behavior

darkhorse-spb commented 2 years ago

I'm also having this issue with stable release, tx 0x42b8205ed4c9d9de39340999c05327543f422b4ca881ae5910d56b3ad62d19c6

mandrigin commented 2 years ago

okay what we can try to do is try to change the json serialization library in an experiment branch and then @banteg @darkhorse-spb if you can test it on your machines and see if that helps at all

AskAlexSharov commented 2 years ago

@mandrigin debug_traceTransaction already using jsoniter.Stream serialization lib, and it must do streaming (in no-batch and no-websocket cases), probably it doesn't because I disabled it in ./rpc/handler.go handleMsg to fix broken JSON format in case of errors error.

It's impossible to stream json, and return error if error happened in the middle of streaming. Because json is not streaming-friendly format.

mandrigin commented 2 years ago

I also have a weird idea of using ETL to first dump everything to the binary files, check for errors and then stream results.

mandrigin commented 2 years ago

but the question is also, what eats all this RAM?

@banteg can I ask you to run Erigon with the built-in rpc daemon and with --pprof and then when it begins eating RAM, maybe at 60 or 80 GB, do curl http://127.0.0.1:6060/debug/pprof/heap > heap.out and attach this file here? then I can look at the profiler too

AskAlexSharov commented 2 years ago

We decided to enable back streaming feature by default:

https://github.com/ledgerwatch/erigon/pull/4647 Erigon has enalbed json streamin for some heavy endpoints (like trace_*). It's treadoff: greatly reduce amount of RAM (in some cases from 30GB to 30mb), but it produce invalid json format if error happened in the middle of streaming (because json is not streaming-friendly format)

We decided that value from this streaming is higher than handling "error happen in the middle" rare corner case. But added flag: --rpc.streaming.disable if users wish to pay for correctnesses or compatibility.

mandrigin commented 2 years ago

@banteg @darkhorse-spb can you check in the current devel version and see if it helped?

tjayrush commented 2 years ago

but it produce invalid json format if error happened in the middle of streaming (because json is not streaming-friendly format)

Is it Go Code? We ran into the same issue with TrueBlocks. We stream our data too.

We were able to get around it using a defer call that closes and open JSON objects or arrays. It's not perfect -- it doesn't work that well with nested objects, but it works for simple arrays and simple objects for example. If any sub-routines return an error, the defer simply closes the array.

If the program crashes, and a subroutine never returns, it doesn't work, but the program crashed, so something isn't working anyway.

AskAlexSharov commented 2 years ago

Then user will not see error message at all

tjayrush commented 2 years ago

We attach the error as another field in the object in the defer method. Not perfectly compliant JSON, but it works. (Perfectly compliant JSON, if it returns an error, should return empty data -- but that's not possible since you've already streamed the data.)

AskAlexSharov commented 2 years ago

@tjayrush it even may work in many client libs. do you have some open-source example?

tjayrush commented 2 years ago

I'm almost embarrassed to show it. It's super hacky, but here's an example: https://github.com/TrueBlocks/trueblocks-core/blob/feature/new-unchained-index-2.0/src/apps/chifra/internal/chunks/handle_addresses.go#L66. The RenderFooter routine (which closes an array and an object (everything our API delivers has the same shape) get called even if an error happens. We deliver the error on standard error many levels above this code, so it just closes the JSON object and returns the error (or nil if there is no error).

AskAlexSharov commented 2 years ago

tnx, will try tomorrow

mandrigin commented 2 years ago

@AskAlexSharov do you want to keep this one around?

AskAlexSharov commented 2 years ago

It’s fixed - streaming enabled. But we need add this approach also: https://github.com/ledgerwatch/erigon/issues/4637#issuecomment-1176407488

mandrigin commented 2 years ago

okay @nanevardanyan will take a look at the error handling then

banteg commented 2 years ago

seems fixed on erigon's side, but clients would need to consider streaming too. one of the traces i reported yields a 66.5GB response. here is a small script which will show both compressed and uncompressed size of the response.

https://gist.github.com/banteg/98dbccbf6e2a3f997199a1b16eb93c5a

banteg commented 2 years ago

reran with my dataset. you can clearly see the outliers i found earlier: trace-size elapsed-size gas-size

here are response sizes:

0x9ef7a35012286fef17da12624aa124ebc785d9e7621e1fd538550d1209eb9f7d = 41.4 GB (2.2 GB compressed)
0xd770356649f1e60e7342713d483bd8946f967e544db639bd056dfccc8d534d8e = 43.9 GB (2.4 GB compressed)
0x2428a69601105c365b9fe9d2f30688b91710b6a43bc6d2026344674ae7ffcac3 = 50.4 GB (2.9 GB compressed)
0xb9e6b6f275212824215e8f50818f12b37b7ca4c2e0b943785357c35b23743b94 = 71.5 GB (3.5 GB compressed)

all other traces are under 4 GB.