QuantumEntangledAndy / neolink

An RTSP bridge to Reolink IP cameras
GNU Affero General Public License v3.0
330 stars 45 forks source link

Potential memory leak #202

Open Dinth opened 9 months ago

Dinth commented 9 months ago

Describe the bug Hi. I might have found a memory leak, but apologies - i am not really able to provide more details, been dealing with an IT disaster at home and my only access to the docker machine is via Java client connected to iDRAC. During last night, my pfSense died, together with DHCP server.

I can only provide some screenshots, as my docker machine lost dhcp lease and its offline. Screenshot from 2023-12-10 12-06-57

Neolink logs - hundreds of screens of this: image

Versions NVR software: Frigate Neolink software: (apologies, currently my docker machine is down, but its latest docker image as of 9/12/2023, maintained by watchtower)

bwthor commented 9 months ago

I'm running mine on Windows, and resorted to just using Task Scheduler to kill and restart the neolink service every 3 hours. Not sure the memory issues have been worked out, and unfortunately I haven't see the maintainer around here for a few weeks.

zeluisg commented 8 months ago

I'm running in LXC container in a proxmox server. I still have a steady memory leak, that ends up crashing the container (4GB)

QuantumEntangledAndy commented 8 months ago

I'm aware. I just don't have the time to deal with this at the moment. Will be awhile before my other commitments clear up and I can get back to coding on this one.

I'm suspecting this is happening on gstreamer side of things and I want to look into using a gstreamer buffer pool instead of constantly creating new buffers for each frame. But can't handle it yet

QuantumEntangledAndy commented 5 months ago

I've posted this on the other issue too but thought I'd post here in case you aren't subscribed there. I setup valgrind in some docker containers to test things:

Camera and Client Setup

Debian Bookworm gstreamer1.0-x (1.22.0-3+deb12u1)

Screenshot 2024-04-08 at 15 16 17

Debian Sid gstreamer1.0-x (1.24.1-1 and others)

Screenshot 2024-04-08 at 15 16 31

It seems to be stable at 4MB. Are there any more setup details that could help me find out what is causing this. How are you connecting etc?

Dinth commented 5 months ago

Hey. I have just restarted neolink container and noticed that its not working again and takes over 1GB of RAM. After the restart it was using 60mb of ram but already grew to 100mb after only a couple of minutes.

Ive got one Argus 2 camera and im hooking up neolink to Frigate NVR.

image

Dinth commented 5 months ago

After aroundf an hour image

QuantumEntangledAndy commented 5 months ago

Perhaps this is frigate creating many clients connections. There was a similar issue elsewhere with the connections from frigate not closing fully and frigate just opening more and more connections. Can't remember what we did to fix that.

I might spin up a frigate docker to test this against.

QuantumEntangledAndy commented 5 months ago

As another thing to try I could wrote a modified Docker that dumps the valgrind info. Maybe you could you run it?

Dinth commented 5 months ago

I am happy to help with testing, but i would greatly appreciate if you could upload it as a new branch (so i can just replace the image source in Portainer), as im on a course this week

Dinth commented 5 months ago

image after 10h

fleaz commented 5 months ago

I can also easily recreate this problem. I currently have a 3G memory limit on the container, and the container gets killed roughly every ~2-3 hours.

If you need any help to collect more information on this, I'm more than happy to help.

QuantumEntangledAndy commented 5 months ago

@Dinth What is the architecture of your Portainer machine? Can I build just the x86_64 or am I going to need to go the extra mile and build arm as well?

Dinth commented 5 months ago

@Dinth What is the architecture of your Portainer machine? Can I build just the x86_64 or am I going to need to go the extra mile and build arm as well?

Im on x86_64, many thanks!

QuantumEntangledAndy commented 5 months ago

Ok so the docker will be here (in half an hour)

docker pull quantumentangledandy/neolink:test-valgrind

The binding of the config and ports are the same as usual

BUT there is an extra volume you should mount /valgrind/, the valgrind output goes in there (specifically /valgrind/massif.out but binding the whole dir is fine too)

Valgrind output is only created when the app exits normally not when killed by docker stop (or other forcefull stopping methods) so to help with that I have added a timeout of 30mins after which it will stop itself and write the file I need which will be massif.out. If you really need to stop it early send a SIGINT (not sure how portainer can do that)

p.s. the docker image is still building here https://github.com/QuantumEntangledAndy/neolink/actions/runs/8611052761, please wait half hour or so until it is ready before you pull it

QuantumEntangledAndy commented 5 months ago

P.s. the docker build succeeded. Please run when you can and post the massif.out when you can

Dinth commented 5 months ago

Heres the generated file massif.out.zip

QuantumEntangledAndy commented 5 months ago
Screenshot 2024-04-09 at 19 17 39

This is the memory profile of that massif.out

Dinth commented 5 months ago

Hey. Its a really weird thing, but it seems that since i moved to the test-valgrind branch, my neolink RAM usage has stopped uncontrollably growing. I have restarted the container twice since and it's stops growing at 400mb (as shown by Portainer). I think i still need some more time to test that (since the ram usage was not growing immediatelly but after some time).

Dinth commented 5 months ago

ahh thats because it sigterms:

2024-04-09T12:58:42.005180861Z ==9== 

2024-04-09T12:58:42.005248333Z ==9== Process terminating with default action of signal 15 (SIGTERM)

2024-04-09T12:58:42.005593091Z ==9==    at 0x558D719: syscall (syscall.S:38)

2024-04-09T12:58:42.005701720Z ==9==    by 0x1D40836: parking_lot_core::thread_parker::imp::ThreadParker::futex_wait (linux.rs:112)

2024-04-09T12:58:42.005734066Z ==9==    by 0x1D405E3: <parking_lot_core::thread_parker::imp::ThreadParker as parking_lot_core::thread_parker::ThreadParkerT>::park (linux.rs:66)

2024-04-09T12:58:42.005793619Z ==9==    by 0x1D49C9D: parking_lot_core::parking_lot::park::{{closure}} (parking_lot.rs:635)

2024-04-09T12:58:42.005846198Z ==9==    by 0x1D4882E: with_thread_data<parking_lot_core::parking_lot::ParkResult, parking_lot_core::parking_lot::park::{closure_env#0}<parking_lot::condvar::{impl#1}::wait_until_internal::{closure_env#0}, parking_lot::condvar::{impl#1}::wait_until_internal::{closure_env#1}, parking_lot::condvar::{impl#1}::wait_until_internal::{closure_env#2}>> (parking_lot.rs:207)

2024-04-09T12:58:42.005869718Z ==9==    by 0x1D4882E: parking_lot_core::parking_lot::park (parking_lot.rs:600)

2024-04-09T12:58:42.005889943Z ==9==    by 0x1D4EFD8: parking_lot::condvar::Condvar::wait_until_internal (condvar.rs:333)

2024-04-09T12:58:42.005942304Z ==9==    by 0x1C3484D: parking_lot::condvar::Condvar::wait (condvar.rs:256)

2024-04-09T12:58:42.006016975Z ==9==    by 0x1CED5BB: tokio::loom::std::parking_lot::Condvar::wait (parking_lot.rs:149)

2024-04-09T12:58:42.006041703Z ==9==    by 0x1C636F4: tokio::runtime::park::Inner::park (park.rs:116)

2024-04-09T12:58:42.006061069Z ==9==    by 0x1C642D6: tokio::runtime::park::CachedParkThread::park::{{closure}} (park.rs:254)

2024-04-09T12:58:42.006077276Z ==9==    by 0x1C64475: tokio::runtime::park::CachedParkThread::with_current::{{closure}} (park.rs:268)

2024-04-09T12:58:42.006175963Z ==9==    by 0x1C21327: std::thread::local::LocalKey<T>::try_with (local.rs:286)

2024-04-09T12:58:42.006196923Z ==9==    by 0x1C64415: tokio::runtime::park::CachedParkThread::with_current (park.rs:268)

2024-04-09T12:58:42.006212884Z ==9==    by 0x1C6425D: tokio::runtime::park::CachedParkThread::park (park.rs:254)

2024-04-09T12:58:42.006268175Z ==9==    by 0x976218: tokio::runtime::park::CachedParkThread::block_on (park.rs:285)

2024-04-09T12:58:42.006436287Z ==9==    by 0xA9C604: tokio::runtime::context::blocking::BlockingRegionGuard::block_on (blocking.rs:66)

2024-04-09T12:58:42.006534954Z ==9==    by 0x397195: tokio::runtime::scheduler::multi_thread::MultiThread::block_on::{{closure}} (mod.rs:87)

2024-04-09T12:58:42.006628761Z ==9==    by 0xBD5663: tokio::runtime::context::runtime::enter_runtime (runtime.rs:65)

2024-04-09T12:58:42.006654879Z ==9==    by 0x397150: tokio::runtime::scheduler::multi_thread::MultiThread::block_on (mod.rs:86)

2024-04-09T12:58:42.006696927Z ==9==    by 0xA9FE94: tokio::runtime::runtime::Runtime::block_on (runtime.rs:351)

2024-04-09T12:58:42.006766288Z ==9==    by 0x909AAF: neolink::main (main.rs:85)

2024-04-09T12:58:42.006823569Z ==9==    by 0x98384A: core::ops::function::FnOnce::call_once (function.rs:250)

2024-04-09T12:58:42.006968811Z ==9==    by 0x64D28D: std::sys_common::backtrace::__rust_begin_short_backtrace (backtrace.rs:155)

2024-04-09T12:58:42.007054233Z ==9==    by 0x449190: std::rt::lang_start::{{closure}} (rt.rs:166)

2024-04-09T12:58:42.007226028Z ==9==    by 0x24E21F0: call_once<(), (dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe)> (function.rs:284)

2024-04-09T12:58:42.007281819Z ==9==    by 0x24E21F0: do_call<&(dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe), i32> (panicking.rs:554)

2024-04-09T12:58:42.007343138Z ==9==    by 0x24E21F0: try<i32, &(dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe)> (panicking.rs:518)

2024-04-09T12:58:42.007402426Z ==9==    by 0x24E21F0: catch_unwind<&(dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe), i32> (panic.rs:142)

2024-04-09T12:58:42.007460927Z ==9==    by 0x24E21F0: {closure#2} (rt.rs:148)

2024-04-09T12:58:42.007525706Z ==9==    by 0x24E21F0: do_call<std::rt::lang_start_internal::{closure_env#2}, isize> (panicking.rs:554)

2024-04-09T12:58:42.007574276Z ==9==    by 0x24E21F0: try<isize, std::rt::lang_start_internal::{closure_env#2}> (panicking.rs:518)

2024-04-09T12:58:42.007616162Z ==9==    by 0x24E21F0: catch_unwind<std::rt::lang_start_internal::{closure_env#2}, isize> (panic.rs:142)

2024-04-09T12:58:42.007658013Z ==9==    by 0x24E21F0: std::rt::lang_start_internal (rt.rs:148)

2024-04-09T12:58:42.007672256Z ==9==    by 0x449169: std::rt::lang_start (rt.rs:165)

2024-04-09T12:58:42.007696468Z ==9==    by 0x909B5D: main (in /usr/local/bin/neolink)

2024-04-09T12:58:42.009966184Z ==9== 

2024-04-09T12:59:03.310544775Z ==9== Massif, a heap profiler

2024-04-09T12:59:03.310618694Z ==9== Copyright (C) 2003-2017, and GNU GPL'd, by Nicholas Nethercote

2024-04-09T12:59:03.310633571Z ==9== Using Valgrind-3.19.0 and LibVEX; rerun with -h for copyright info

2024-04-09T12:59:03.310646373Z ==9== Command: /usr/local/bin/neolink rtsp --config /etc/neolink.toml

2024-04-09T12:59:03.310658339Z ==9== 
QuantumEntangledAndy commented 5 months ago

That sigterm is expected. It's the way the timeout command signals the app to stop. Nothing wrong with it at all

QuantumEntangledAndy commented 5 months ago

Can you try pulling latest master branch of the docker? Perhaps something very recent fixed it.

Dinth commented 5 months ago

On another hand, i can see that youre measuring the memory usage of a neolink process, but is it possible that:

I will try to check it, but i have already updated to Docker 26 and something is broken with exec command. Will get back to you on this.

Regarding the master branch, i have been on :latest with Watchtower automatically updating neolink

QuantumEntangledAndy commented 5 months ago

Nope will be single process. Even if children were spawned valgrind would track them too.

There are two main changes here

It's possible the release build breaks something with its code optimisations.

QuantumEntangledAndy commented 5 months ago

I could try to run valgrind on the release build. It will just make figuring out what is wrong much harder without the debug symbols

QuantumEntangledAndy commented 5 months ago

Anyways I'm off to sleep now so I'll do that tomorrow

Dinth commented 5 months ago

Ive just been running the :latest release for 40 minus and top shows: image

Looks like its actually neolink process using that memory

smaps dump: https://paste.ubuntu.com/p/jd7W6rGDw2/

QuantumEntangledAndy commented 5 months ago

I'm building the realease version of the valgrind docker here https://github.com/QuantumEntangledAndy/neolink/actions/runs/8624926076 should be ready soon

QuantumEntangledAndy commented 5 months ago

Alright the valgrind docker is ready, I will test it too

QuantumEntangledAndy commented 5 months ago

This is my memory profile on release build

Screenshot 2024-04-10 at 10 24 49

Shame I can't replicate this

Dinth commented 5 months ago

Could you modify Valgrind build to work for lets say 2 hours before sigterming?

QuantumEntangledAndy commented 5 months ago

I could but that would mean a full rebuild. It would be faster is you just changed the CMD in portainer to something like this

timeout 7200 valgrind --tool=massif --massif-out-file=/valgrind/massif.out /usr/local/bin/neolink rtsp --config /etc/neolink.toml
Dinth commented 5 months ago

ahh, thats simple, thanks. I am still on a course but i will try to generate a new valgrind log today

QuantumEntangledAndy commented 5 months ago

Given the symptoms we are seeing I am suspecting we have memory fragmentation not a memory leak. The fragmentation differs depending on the OSs memory allocator and valgrind replaces the default allocator with its own. This is why it is not showing up in valgrind.

fleaz commented 5 months ago

Here is a valgrind dump from the latest image you build today: massiv.out

But even with the valgrind image I still see the growth in memory. First two spikes where a freshly build container from the master branch from yesterday, the small peak at 14:00 was the 30min run of the valgrind container: 2024-04-10T14:49:59,535882416+02:00

Dinth commented 5 months ago

Here is mine massif.out: massif.out.zip I have been running it for 2 hours, the memory usage grew to 800mb during that time - seems that its growing slower with valgrind running but its growing. Anyhow, i have opened the file in massif visualizer already and its not showing up any excessive memory usage at all. Maybe it is just fragmentation not a real memory usage as you mentioned, but i havent noticed similar behaviour in any other of my containers, even a large postresql database docker sticks to 6gb. Not to mention crashes when the memory usage grows too much

QuantumEntangledAndy commented 5 months ago

I'm going to run some optimisations based on valgrind dhat and see if I can reduce allocations with some buffer pools. That's should help a bit with the fragmentation.

The only trouble with this plan is that buffer pools work best with constant size and the h264 frames from the camera are not constant. I've had an idea to create and insert NAL filler frames to pad up to 4kb blocks and it seems to work with h264 but I can't test for h265.

QuantumEntangledAndy commented 5 months ago

@fleaz this is your massif.out put through the visualiser

Screenshot 2024-04-11 at 11 11 30

Not seeing any spiking in there

QuantumEntangledAndy commented 5 months ago

I've pushed my buffer pool onto the test/valgrind branch and I will try to see what other things I could do to reduce fragmentation. It can be difficult to address though

QuantumEntangledAndy commented 5 months ago

It seems allocations of the gstreamer buffer is now reduced. There's still a lot in libcrypto which is either the AES or the push notifications

QuantumEntangledAndy commented 5 months ago

Seems that the libcrypto alloc are comming from the push notifications. Not sure why there are so many of those that it takes more blocks than the gstreamer stuff.

If you can please try the latest test/valgrind with

[[cameras]]
# Other usual things
push_notifications = false

That should turn it off and we can test if this still has memory issues

Dinth commented 5 months ago

its still happening on latest test-valgrind: image but so far i only tested WITHOUT push_notifications = false. I will disable push notifications in the afternoon and try again

Dinth commented 5 months ago

same thing with push_notifications =false image

fleaz commented 5 months ago

Apparently the problem is sometimes way worse than on other runs 🤔 I have created tooling to move the massif file after every run so they don't get overwritten every time, and I will try to provide you a file of one of those bad runs. Maybe they are more helpful to you :)

Screenshot shows the 30min runs of the Valgrind container over the last 12h: 2024-04-11T22:19:35,833406447+02:00

QuantumEntangledAndy commented 5 months ago

I'll need a massif.out of it. The graph by itself is not helpful, but the massif tools tracks the allocations and the deallocations and shows me what part of the code is creating the memory

QuantumEntangledAndy commented 5 months ago

Might have found something. I added some prints to debug the length of all the various buffers that are used.

One of the buffers went up to 87808 over night which is much too large. The code is only meant to hold 15s of buffered time and it does this by checking by filtering all packets that are >15 behind the last recieved frame

I think the issue though is how reolink time stamps I supect it gave me lots of frames in the future that aren't being pruned. I'll see if I can buffer this a different way

QuantumEntangledAndy commented 5 months ago

If you could test the next build that would be good, it has a lot of debug prints to show all the buffer lens but hopefully the adjustment to history length will help

fleaz commented 5 months ago

Thanks for diving into this and trying to find the problem!

Here is one of the "bad" runs according to my memory graph in Grafana. Hope it helps: massif_2024-04-17_2025.txt

I also updated the container and will report if I see any change

QuantumEntangledAndy commented 5 months ago

Any updates since the last build? I think its addressed and if so I'd like to push a release

Dinth commented 5 months ago

much better on mine side After 4h: image

fleaz commented 5 months ago

Pulled a new image (quantumentangledandy/neolink:latest) yesterday and after ~20h the container is a 400MB memory usage. Still feels a bit high for what it does, but the actual leak appears to be gone. Great work and thanks for investing so much time and effort into this 🥳