chirpstack / chirpstack-gateway-bridge

ChirpStack Gateway Bridge abstracts Packet Forwarder protocols into Protobuf or JSON over MQTT.
https://www.chirpstack.io
MIT License
423 stars 270 forks source link

memory issue #86

Closed gillespilloudkerlink closed 6 years ago

gillespilloudkerlink commented 6 years ago

Hi all,

I'm currently use the lora-gateway-bridge on a production environment with around 7500 uplink / 750 downlink per days.

Sometimes (3 times in 3 months) the available memory suddently decrease on the gateway.

I have restarted the lora-gateway-bridge component and the memory was released

image

before : USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 29286 2.7 86.7 796328 220144 ? Sl Jun20 87:22 /opt/lora-gateway-bridge/bin/lora-gateway-bridge --log-level 4

after: USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 5405 11.6 1.5 794984 3900 pts/0 Sl 11:56 0:00 /opt/lora-gateway-bridge/bin/lora-gateway-bridge --log-level 4

On the logs I haven't find something wrong.

I'm using the following mlinux version on the multitech : root@mtgateway0036:/etc# more mlinux-version mLinux 3.3.9 Built from branch: (detachedfrom0947d21) Revision: 0947d212fc4c10871c66b4d83c97c5634dc1b8c7

Have you an idea about how the root cause can be found ?

brocaar commented 6 years ago

Hi @gillespilloudwyres that is odd. I have never seen this. When there would be a memory leak in LoRa Gateway Bridge, I would expect that the usage would increase over time, not suddenly.

One thing that probably will be provided in the near future is Prometheus integration for the LoRa Server components. This will make it possible to monitor the health of each LoRa Server service (instance) and will also expose information about the Go runtime. I think this would be beneficial to help you debugging your issue.

Do you think that would help you?

gillespilloudkerlink commented 6 years ago

Hello @brocaar , thanks for your reply. That is odd, I can confirm.

In parallel of the prometheus integration, could theses statistics be available by a mqtt subscription to be independant. In my production platform I want to integer all of my metrics to my zabbix server.

Regards,

brocaar commented 6 years ago

Yes and no ;)

Prometheus exposes a HTTP endpoint where one could fetch all the metrics. E.g. on an interval of 10, 20, 30, ... seconds, so LoRa Gateway Bridge will not send this over MQTT.

However, it is really easy to make a service sitting next to LoRa Gateway Bridge doing a periodical GET, format the metrics in a format usable by your Zabbix server and push this over MQTT :-)

gillespilloudkerlink commented 6 years ago

As I have observed today, the TX is stopped by the gateway as soon as the memory decrease. Sometimes, the consequence of the lake of memory seems to be a reboot of the gateway or a process kill.

I think install a watchdog to restart the process when the problem occur, now 2 times per day.

brocaar commented 6 years ago

Thanks for the feedback. In the meanwhile, I did spend some time on Prometheus integration. I will see if I can add this asap to LoRa Gateway Bridge so we have a bit more information about what could be the issue (e.g. it will expose information about the memory usage, number of go-routines + I can add my own counters, timers etc. too).

gillespilloudkerlink commented 6 years ago

As I can continue to observe, is when the network is down and back, the gateway-bridge has difficulties to restore the communication with the mosquitto server.

brocaar commented 6 years ago

I'm now running a test server which is processing ~ 90k uplinks per day. This dashboard is just an example, it is tracking more metrics:

image
brocaar commented 6 years ago

So far things are pretty stable

image

I will also do some more testing to see what happens when the MQTT connection becomes unavailable for a longer time.

gillespilloudkerlink commented 6 years ago

I will also do some more testing to see what happens when the MQTT connection becomes unavailable for a longer time.

Yes , when the mqtt connection shutdown.

I think than the memory issue start around this logs, and a full restart of the gateway bridge is necessary to have the messages back :

time="2018-07-13T11:14:06+02:00" level=error msg="backend: mqtt connection error: write tcp a.b.c.d:34609->e.f.g.h:1883: write: connection reset by peer"

time="2018-07-13T11:14:07+02:00" level=info msg="backend: connected to mqtt broker"

time="2018-07-13T11:14:07+02:00" level=info msg="backend: re-registering to gateway topics" topic_count=1

brocaar commented 6 years ago

Quite recently a deadlock related fix has been added to the MQTT library I'm using which would cause a (MQTT) token to never return meaning the call would hang forever (multiplied with the number of received UDP messages this could increase the memory usage). Maybe this solves your issue? I have updated the vendors in the latest commit in the master branch. Could you re-test with that version?

If needed I could provide pre-compiled binaries. In case you compile yourself, don't forget to run a dep ensure after the git pull (to update the vendor directory).

gillespilloudkerlink commented 6 years ago

wow ! Thanks a lot for this update ! this problem seems related. have you the possibility to easily build it for a multitech target ?

Thanks

brocaar commented 6 years ago

@gillespilloudwyres I've generated these pre-compiled binaries for you: https://www.dropbox.com/sh/izdjggor4doytqd/AADMdKVctil54cpX7CCK4LILa?dl=0 (for the Conduit I believe you need the arm v6 binary).

gillespilloudkerlink commented 6 years ago

I have used the armv5 👍 $uname -a Linux mtgateway0043 3.12.27r15 #1 PREEMPT Tue Feb 20 12:12:48 CST 2018 armv5tejl GNU/Linux

the binary is in test on our office, and we plan to deploy it on the impacted environment as soon as we can.

gillespilloudkerlink commented 6 years ago

deployed on the impacted environment. Is a specific log attended when the mqtt deadlock occured ? or the root cause is normally solved by this version ? I will update this ticket tomorrow after 24 hours of work and around 350 000 rx/2000 tx.

brocaar commented 6 years ago

The root cause of the deadlock should be resolved by this version (as it uses the updated mqtt client). There won't be anything in the logs that will indicate this.

On Tue, Jul 24, 2018 at 3:04 PM Gilles PILLOUD notifications@github.com wrote:

deployed on the impacted environment. Is a specific log attended when the mqtt deadlock occured ? or the root cause is normally solved by this version ? I will update this ticket tomorrow after 24 hours of work and around 350 000 rx/2000 tx.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/brocaar/lora-gateway-bridge/issues/86#issuecomment-407399024, or mute the thread https://github.com/notifications/unsubscribe-auth/AAKGecxDC2bBnN2Gcwbwhc9bDwPEqhSNks5uJxtqgaJpZM4Uzte8 .

brocaar commented 6 years ago

@gillespilloudwyres did this solve your issue?

gillespilloudkerlink commented 6 years ago

hello,

I have doubts about that. To have a functionnal solution, A functionnal check has been developped to verify if a message come to the loraserver from a specific gateway. In case of no messages, a reboot of the gateway bridge is sent.

How should I integer the prometeus checks to verify the number of go routines?

Regards,

brocaar commented 6 years ago

@gillespilloudwyres I'm not sure if I understand your comment. Does this version still have the memory issue that you have before?

gillespilloudkerlink commented 6 years ago

Hi Orne,

I still have memory glitch, but I have putted watchdog to handle them. In my case the network of my client has strange reaction (at the ICMP level, for example, ping packets are retained for 30 seconds sometimes...).

I prefer close this issue and work with my client to improve his network stability.

thanks