TheThingsArchive / packet_forwarder

Packet forwarder for Linux based gateways
MIT License
84 stars 92 forks source link

High cpu usage #39

Closed cambierr closed 6 years ago

cambierr commented 7 years ago

This is a bug report for the kerlink packet forwarder.

From what I can see on a kerlink gateway with the newly installed TTN packet forwarder, the CPU usage is very high:

capture du 2017-06-01 21-25-18

Top shows a CPU usage of at least 50% by the ttn forwarder. Any idea on this ?

egourlao commented 7 years ago

This is definitely an issue that needs to be addressed.

To investigate this, we can use a hidden flag for the packet forwarder, --run-trace, that allows us to create a Go trace ; so for example, using the packet forwarder with packet-forwarder start --trace trace.out, we can then read it with go tool trace trace.out. The trace option can also be activated by adding trace: <trace-filepath> to the configuration file. I don't have access to a Kerlink right now, but if you want to look into it, that's what I would start with.

cambierr commented 7 years ago

I'll try to look at this. If required, I can provide you with an SSH acces to some kerlinks ?

egourlao commented 7 years ago

The gateway I'm using for testing should be back soon, but thanks for the proposal anyway! I'll let you know what the tests tell me.

cambierr commented 7 years ago

Small update to tell you that the high cpu usage is still present in 2.0.2 ;-) capture du 2017-06-07 18-44-57 green line is a kerlink using the "default" packet forwarder ;-)

egourlao commented 7 years ago

Thanks for the report! @h0l0gram also reported a high CPU usage on Raspberry Pis (~15% on average):

pktfwd-load

cambierr commented 7 years ago

0.15 is nice, compared to my loadavg of 1.5-1.6 ;-)

egourlao commented 7 years ago

This is what a CPU profiling of a 4mn run of the packet forwarder, on a Multitech Conduit build (built on the feature/benchmark branch, commit ba8ede1e4bf71ee204540d57ae15848412dc836d) returns:

screen shot 2017-06-19 at 14 03 32

Most of the functions are Go runtime functions - before looking into them, I looked into the only resource-heavy function that appeared in this profiling, wrapper.Receive:

screen shot 2017-06-19 at 14 07 19

It appears as if the most expensive calls are the Cgo allocations and Cgo function calls - C.lgw_receive takes 2.18 minutes of processor time within a 4mn run, and the memory allocation of the packets takes 4.67 seconds. Although there isn't a lot of literature regarding Cgo, the few existing blog posts and exchanges tend to go to the direction of Cgo having a large overhead: the first blog post notably gives an example of Cgo sleeps taking 100x the duration of pure Go sleeps.

I can see two possible ways to explore this problem:

I'd be in favor of investigating the first solution. It could help us solve a lot of other issues within the packet forwarder - it would, for example, help us build tests in the packet forwarder (which is currently hard because of tight linking between the PF logic and the HAL - which would make less sense if decoupled), have a leaner and Go-ish build process, would give us more insight on what's happening in the HAL (for example, for this high-priority reliability issue). After giving a small look to the HAL (Receive TX packets function, Send TX packets function), the logic doesn't seem excessively complicated.

cc @johanstokking @htdvisser

cambierr commented 7 years ago

Any update on this one ?

egourlao commented 7 years ago

Sorry, I haven't had much time until now to look into the issue!

I've been spending a bit of time running benchmarks of the lora_gateway library, between C and Go calls. Because of the above profiling results, I thought that the Cgo (C interface for Go) was causing the performance issues because of the computing overhead. However, I'm actually not that convinced now: after running the benchmarks, it seems calling 100 000 times lgw_receive on a pure C benchmark takes ~4 minutes, whereas calling it from a Go program takes ~6 minutes - which isn't that much of a difference. So it might be that the performance issues don't come from Cgo, but from another part of the program. I've only run my benchmarks on a Multitech Conduit though, so it might be that the performance issues are specific to the Kerlink binaries.

However, as raised by @johanstokking the issue doesn't only come from the computing time, but from the memory allocations made by the Go runtime for every Cgo call made (see this thread). This makes sense when taking into account the high frequency of Cgo calls (1 call every 100us at the beginning, up 1 call every 5ms after startup), and the hardware used by the IoT Station. Maybe one way to mitigate this could be to block the execution inside the Cgo code rather than inside the Go code when there is no RX packet to process, so that the Go runtime doesn't have to make the way back between Go and Cgo every few milliseconds.

cambierr commented 7 years ago

I'm not familiar enough with that technology to investigate this, but in case you have anything that you want me to try on the kerlinks I have (I do have two different models + a few ibts), feel free to ask