Document getting flamechart of live node

dapplion commented 1 year ago

To understand what's affecting Lodestar performance our current strategy is to attach a chromium dev tools instance into a node running with node --inspect

Those dev tools can render a stack chart by time, but not a regular flamechart by purely stack occurrences. The information exposed by node --inspect should be enough to produce a flamechart.

CC: @matthewkeil @nflaig @tuyennhv

[x] Research existing tools that can produce and render a flamechart attaching into a running process; i.e. run process as normal for 1 hour, then capture stacks for 12 minutes, then detach.
[x] Write reproducible steps + scripts for other contributors

nflaig commented 1 year ago

could give node-clinic a try, specifically node-clinic-flame

flame graph looks like this

dapplion commented 1 year ago

Can it attach to a process live after N time? Docs I've seen appear to capture flamecharts on entire process lifetime

nflaig commented 1 year ago

Based on the clinic flame flags I don't see an option to do that.

Can be used programmatically as well, then could just call flame.collect after N time, see node-clinic-flame docs.

Might have to use 0x directly to get more fine-grained control as this seems to have a --collect-delay flag (Specify a delay(ms) before collecting data).

matthewkeil commented 1 year ago

All of the libraries I found seem to want to run the node process directly and cannot be turned on and then off again programatically. After a bit of googling I found this blog and noted the flags. I looked at the Ox code and found a few things to better understanding how that library works.

It looks like they are running the process here and here depending on if they are profiling v8 or linux.

For v8 they are out-putting the isolate data to a log. Then then parse the data into ticks here to build the graph data.

For linux they are launching using a system level perf command and passing --perf-basic-prof and then turn the trace into ticks here

~~Sadly in both situations they use a flag to output perf data. I am doing a bit more digging to see how it is implemented to try and understand if/what the performance implications will be.~~

There are 4 perf flags that are available and the node flamegraph docs talk about a few of the details.

I am reading about the Chrome Debugging Protocol docs that were referred to in the debugger section of the node docs to see how the protocol works so we can potentially leverage the flag

Updated. ~~I am guessing there will be some degradation so it might not be ideal for prod but will add another comment below when I get further~~

I researched the performance implications of the --inspect flag and there is none when a debugger is not attached. However, when one is attached it is 100x to 300x according to this thread on SO.

This post elaborates a bit on the security risks but it also mentions that debugging can be flipped on with kill -usr1 ${PID} but I have not tested.

matthewkeil commented 1 year ago

During my journey I found an interesting video... https://www.youtube.com/watch?v=Xb_0awoShR8&t=570s

Updated. The speaker talks about process._debugProcess(pid) to turn on debugging from outside of a running node instance. It is what node uses under the hood to turn on debugging incidentally. It and _debugEnd are available.

This is another highlight https://youtu.be/Xb_0awoShR8?t=682

Updated. The speaker talks about the core debugging protocol and using it for profiling. See the links below for more detail: https://nodejs.org/dist/latest-v18.x/docs/api/inspector.html#cpu-profiler https://chromedevtools.github.io/devtools-protocol/v8/Profiler/

starting to get somewhere notable I think 😄

matthewkeil commented 1 year ago

@dapplion I also found the reference that I mentioned on standup, it is a youtube video. It wasn't a blog is why I couldn't find with a google search. It is a Netflix engineer talking about flame graphs on running node process in prod.

The speaker talks about the --perf flags mentioned above in this video here and in particular the usage for --perf-basic-prof-only-functions to generate the flamegraph. He mentions that it is very low impact on the running process.

They are using a linux library perf and describe how they implement here. It is the same library that Ox is using for linux.

Netflix is using brendangregg/FlameGraph to generate the flamegraphs. Ox is using a custom implementation that renders and bundles an html page that is in source.

As a note there was some GREAT other stuff in that video about post mortem debugging with core-dumps that was very interesting. The whole video is def worth watching.

matthewkeil commented 1 year ago

@dapplion checkout a branch diff here that has an idea for how to generate the stack traces. I talked with @Faithtosin about strategies to collect the flamegraph data when not running locally. Please tell me what you think.

@Faithtosin was asking me about the scope for where and when you would like to run this? Is it just something contributors will want to run on the cloud nodes or should it run locally? I have had challenges running on mac. I have only tried to catch OS level and not v8 level on mac though to mimic what happens on linux

dapplion commented 1 year ago

@tuyennhv captures CPU profiles regularly to ensure Lodestar performance profile is good, i.e. So we usually:

Capture CPU profiles every 2 weeks on latest unstable to understand what's the next target for optimization
Capture CPU profiles on feature branches to ensure they don't introduce visible regressions

So we always run on demand on specific machines. The goal here it to make it easier so we do it more often. But I'm not sure we should bake it into production code, sounds more like a job for external tooling

It should be run on our test nodes in the cloud

matthewkeil commented 1 year ago

@dapplion I added as many references that I could here. It has most of the breadcrumbs that I found and the libraries that I seriously considered (and looked through the code). The rest that came up by searching were either not widely used (less that 100 installs weekly) or were very old and hadn't had a commit since 2017

ChainSafe / lodestar

Document getting flamechart of live node #5341