containers / netavark

Container network stack
Apache License 2.0
516 stars 83 forks source link

Thread leak in netavark-dhcp-proxy #811

Open jsonn opened 1 year ago

jsonn commented 1 year ago

Using SuSE MicroOS with a bunch of macvlan-using containers, I see netvark-dhcp-proxy hanging every few days. From journalctl:

netavark[14606]: thread 'tokio-runtime-worker' panicked at 'failed to spawn thread: Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }', /home/abuild/rpmbuild/BUILD/rustc-1.71.1-src/library/std/src/thread/mod.rs:686:29

Even with RUST_BACKTRACE=1 set, it doesn't give a backtrace. Last time this happened, ps reported over 4000 threads for the PID.

Luap99 commented 1 year ago

How many macvlan containers are we talking about? Do you know how long your DHCP lease time is?

jsonn commented 1 year ago

16 container ATM, 10 minutes.

Luap99 commented 1 year ago

Ok I think that explains why it leaks so fast then. I think we spawn a new thread for each lease but somehow the code does not cleanup the old one so we leak the old thread. I take a look.

jsonn commented 6 months ago

Any news?

Luap99 commented 6 months ago

No, I haven't found the time to reproduce this issue.

Jackbaude commented 5 months ago

I can take a look at this issue. Can someone point me in the right direction to reproduce this?

jsonn commented 5 months ago

Use macvlan and a DHCP server with as short a lease as reasonable, e.g. a minute. Observe the number of threads?

Luap99 commented 5 months ago

yes checking ls /proc/$pidOfProxy/task/ over time should show the leak I guess

baude commented 3 months ago

I am now able to replicate. I started 10 containers on a network where the lease is only 60 seconds. In my case, the nv dhcp-proxy PID is 6808 and after a short while:

Threads:    552
jjzazuet commented 2 months ago

Ah, just noticed this issue. Could this be related? My DHCP lease time is 30 mins.

https://github.com/containers/netavark/issues/1024

Thanks!

thecubic commented 2 months ago

I definitely have this thread leak, there were 13708 threads for ~15 containers after 3 days of running - and I was also seeing #618 as a symptom (I assume, of thread starvation). I have the underlying pattern (IPv6 multicast on IPv4 network)

I updated past the fix for that specific symptom and I'm watching how many threads it creates long-term

thecubic commented 2 months ago

My thread leak seems "better, but not totally fixed". I have 1497 threads after 6 days (post #1022) versus the 13708 after 3 days.

Importantly the dhcp-proxy is not spinning CPU right now and my core symptom (restarting containers sometimes had dhcp task aborts) is gone