Closed bchavez closed 5 years ago
I tried only using passphrases and the problem persists on Linux.
I added a few more nodes to the network, now 7 windows PC nodes connected 1 Linux server.
If you bring up each node one by one. The problem seems to begin around adding the 5th or 6th node to the Freelan network in a star topology...
One way I've found to reliably reproduce the memory leak is to:
Also the CPU tends to jam at 100% sometimes. All very strange. Let me know if there's anything else I can do to help.
Seems like windows is not immune.
I swapped the center linux hub node with a windows node, and problem persists with a windows node/hub:
830M+ memory usage and continues to climb.
Below is a dump of the center hub (windows) when memory explodes:
C:\Program Files\FreeLAN>bin\freelan -t 1 -d -c config\freelan.cfg
2015-05-10T03:48:13.104035 [INFORMATION] Reading configuration file at: "C:\Program Files\FreeLAN\config\freelan.cfg"
2015-05-10T03:48:13.140037 [INFORMATION] Deriving pre-shared key from passphrase...
2015-05-10T03:48:13.158038 [DEBUG] Opening core...
2015-05-10T03:48:13.169038 [INFORMATION] Enabling pre-shared key authentication.
2015-05-10T03:48:13.186039 [INFORMATION] Starting FSCP server...
2015-05-10T03:48:13.203040 [IMPORTANT] Core set to listen on: 0.0.0.0:12000
2015-05-10T03:48:13.220041 [INFORMATION] Building CA store...
2015-05-10T03:48:13.234042 [INFORMATION] FSCP server started.
2015-05-10T03:48:13.251043 [IMPORTANT] Tap adapter "FREELAN" opened in mode tap with a MTU set to: 1446
2015-05-10T03:48:13.269044 [IMPORTANT] MSS override enabled with a value of: 1392
2015-05-10T03:48:13.286045 [INFORMATION] IPv4 address: 25.0.0.3/8
2015-05-10T03:48:13.303046 [INFORMATION] No IPv6 address configured.
2015-05-10T03:48:14.623122 [INFORMATION] Setting interface metric to: 3
2015-05-10T03:48:14.633122 [INFORMATION] Putting interface into the connected state.
2015-05-10T03:48:14.651123 [INFORMATION] The DHCP proxy is enabled.
2015-05-10T03:48:14.667124 [INFORMATION] Not advertising any routes.
2015-05-10T03:48:14.682125 [INFORMATION] Advertising the following DNS servers: 25.0.0.5
2015-05-10T03:48:14.701126 [DEBUG] Core opened.
2015-05-10T03:48:14.710127 [INFORMATION] Using 1 thread(s).
2015-05-10T03:48:14.723127 [IMPORTANT] Execution started.
2015-05-10T03:48:14.701126 [INFORMATION] Starting tap adapter's thread...
2015-05-10T03:48:14.735128 [DEBUG] Thread #0 started.
2015-05-10T03:48:15.028145 [TRACE] Received a data message from 192.168.0.100:12000 but no session exists. Ignoring.
2015-05-10T03:48:18.063318 [TRACE] Received a data message from AAA.BB.CC.54:12000 but no session exists. Ignoring.
2015-05-10T03:48:18.165324 [TRACE] Received a data message from AAA.BB.CC.54:12000 but no session exists. Ignoring.
2015-05-10T03:48:18.363336 [TRACE] Received a data message from AAA.BB.CC.54:12000 but no session exists. Ignoring.
2015-05-10T03:48:19.112378 [TRACE] Received a data message from AAA.BB.CC.54:12000 but no session exists. Ignoring.
2015-05-10T03:48:19.464399 [TRACE] Received a data message from AAA.BB.CC.53:12000 but no session exists. Ignoring.
2015-05-10T03:48:19.864421 [TRACE] Received a data message from AAA.BB.CC.54:12000 but no session exists. Ignoring.
2015-05-10T03:48:20.633465 [TRACE] Received a data message from AAA.BB.CC.54:12000 but no session exists. Ignoring.
2015-05-10T03:48:20.735471 [TRACE] Received a data message from AAA.BB.CC.54:12000 but no session exists. Ignoring.
2015-05-10T03:48:20.934483 [TRACE] Received a data message from AAA.BB.CC.54:12000 but no session exists. Ignoring.
2015-05-10T03:48:21.684526 [TRACE] Received a data message from AAA.BB.CC.54:12000 but no session exists. Ignoring.
2015-05-10T03:48:29.320962 [DEBUG] Received HELLO_REQUEST from AAA.BB.CC.54:12000.
2015-05-10T03:48:29.354964 [TRACE] Received a SESSION_REQUEST from AAA.BB.CC.54:12000 but no presentation is available. Ignoring.
2015-05-10T03:48:29.373965 [DEBUG] Received PRESENTATION from AAA.BB.CC.54:12000 using pre-shared key authentication.
2015-05-10T03:48:29.396967 [INFORMATION] Accepting PRESENTATION from AAA.BB.CC.54:12000 for pre-shared key authentication: first presentation.
2015-05-10T03:48:29.420968 [DEBUG] Sending SESSION_REQUEST to AAA.BB.CC.54:12000.
2015-05-10T03:48:29.436969 [TRACE] Sending session request message to AAA.BB.CC.54:12000 (next_session_number: 0, local_host_identifier: f870c8a2f771e1782f794434afd5ce31340a52e7007554eecf340365435d7848)
2015-05-10T03:48:29.505973 [DEBUG] Received SESSION from AAA.BB.CC.54:12000 (default: accept).
2015-05-10T03:48:29.521974 [DEBUG] Cipher suite: ecdhe_rsa_aes256_gcm_sha384
2015-05-10T03:48:29.539975 [DEBUG] Elliptic curve: sect571k1
2015-05-10T03:48:29.558976 [TRACE] Received a SESSION from AAA.BB.CC.54:12000 with session number 0 but no session was prepared yet. Preparing a new one.
2015-05-10T03:48:29.612979 [TRACE] Session established with AAA.BB.CC.54:12000. Sending acknowledgement session message back.
2015-05-10T03:48:29.633980 [TRACE] Sending session message to AAA.BB.CC.54:12000 (session number: 0, cipher suite: ecdhe_rsa_aes256_gcm_sha384, elliptic curve: sect571k1).
2015-05-10T03:48:29.665982 [IMPORTANT] Session established with AAA.BB.CC.54:12000.
2015-05-10T03:48:29.682983 [INFORMATION] Cipher suite: ecdhe_rsa_aes256_gcm_sha384
2015-05-10T03:48:29.698984 [INFORMATION] Elliptic curve: sect571k1
2015-05-10T03:48:29.711985 [DEBUG] Sending routes request to AAA.BB.CC.54:12000.
2015-05-10T03:48:29.727986 [INFORMATION] Added system route: Network interface #14 - AAA.BB.CC.54/32 => 192.168.0.1 - metric 0
2015-05-10T03:48:29.786989 [TRACE] Received a SESSION from AAA.BB.CC.54:12000 with session number 0 and cipher suite ecdhe_rsa_aes256_gcm_sha384. A session currently exists and has the same number and cipher suite. Ignoring.
2015-05-10T03:48:29.818991 [DEBUG] Received routes request from AAA.BB.CC.54:12000. Replying with version 0:
2015-05-10T03:48:29.841992 [DEBUG] Sending routes and DNS servers to AAA.BB.CC.54:12000: version 0 (), (25.0.0.5).
2015-05-10T03:48:37.350422 [DEBUG] Received HELLO_REQUEST from AAA.BB.CC.55:12000.
2015-05-10T03:48:37.378423 [DEBUG] Received PRESENTATION from AAA.BB.CC.55:12000 using pre-shared key authentication.
2015-05-10T03:48:37.398424 [INFORMATION] Accepting PRESENTATION from AAA.BB.CC.55:12000 for pre-shared key authentication: first presentation.
2015-05-10T03:48:37.421426 [DEBUG] Sending SESSION_REQUEST to AAA.BB.CC.55:12000.
2015-05-10T03:48:37.437427 [TRACE] Sending session request message to AAA.BB.CC.55:12000 (next_session_number: 0, local_host_identifier: 76f611ccf832b4d8cb7c94a9a45289a9b55b383601d60a2a8d397d765cabd696)
2015-05-10T03:48:37.466428 [DEBUG] Received SESSION_REQUEST from AAA.BB.CC.55:12000 (default: accept).
2015-05-10T03:48:37.484429 [DEBUG] Cipher suites capabilities: ecdhe_rsa_aes256_gcm_sha384 ecdhe_rsa_aes128_gcm_sha256
2015-05-10T03:48:37.508431 [DEBUG] Elliptic curve capabilities: sect571k1 secp384r1
2015-05-10T03:48:37.522431 [TRACE] Received a SESSION_REQUEST from AAA.BB.CC.55:12000 with session number 0 and cipher suite ecdhe_rsa_aes256_gcm_sha384_sect571k1. No current session exist: preparing one and sending it.
2015-05-10T03:48:37.569434 [TRACE] Sending session message to AAA.BB.CC.55:12000 (session number: 0, cipher suite: ecdhe_rsa_aes256_gcm_sha384, elliptic curve: sect571k1).
2015-05-10T03:48:37.646438 [DEBUG] Received SESSION from AAA.BB.CC.55:12000 (default: accept).
2015-05-10T03:48:37.661439 [DEBUG] Cipher suite: ecdhe_rsa_aes256_gcm_sha384
2015-05-10T03:48:37.677440 [DEBUG] Elliptic curve: sect571k1
2015-05-10T03:48:37.699442 [TRACE] Session established with AAA.BB.CC.55:12000. Sending acknowledgement session message back.
2015-05-10T03:48:37.718443 [TRACE] Sending session message to AAA.BB.CC.55:12000 (session number: 0, cipher suite: ecdhe_rsa_aes256_gcm_sha384, elliptic curve: sect571k1).
2015-05-10T03:48:37.751445 [IMPORTANT] Session established with AAA.BB.CC.55:12000.
2015-05-10T03:48:37.767445 [INFORMATION] Cipher suite: ecdhe_rsa_aes256_gcm_sha384
2015-05-10T03:48:37.785446 [INFORMATION] Elliptic curve: sect571k1
2015-05-10T03:48:37.799447 [DEBUG] Sending routes request to AAA.BB.CC.55:12000.
2015-05-10T03:48:37.814448 [INFORMATION] Added system route: Network interface #14 - AAA.BB.CC.55/32 => 192.168.0.1 - metric 0
2015-05-10T03:48:37.835449 [TRACE] Received a SESSION from AAA.BB.CC.55:12000 with session number 0 and cipher suite ecdhe_rsa_aes256_gcm_sha384. A session currently exists and has the same number and cipher suite. Ignoring.
2015-05-10T03:48:37.872451 [DEBUG] Received routes request from AAA.BB.CC.55:12000. Replying with version 0:
2015-05-10T03:48:37.892453 [DEBUG] Sending routes and DNS servers to AAA.BB.CC.55:12000: version 0 (), (25.0.0.5).
2015-05-10T03:48:39.508545 [DEBUG] Received HELLO_REQUEST from AAA.BB.CC.53:12000.
2015-05-10T03:48:39.538547 [TRACE] Received a SESSION_REQUEST from AAA.BB.CC.53:12000 but no presentation is available. Ignoring.
2015-05-10T03:48:39.560548 [DEBUG] Received PRESENTATION from AAA.BB.CC.53:12000 using pre-shared key authentication.
2015-05-10T03:48:39.580549 [INFORMATION] Accepting PRESENTATION from AAA.BB.CC.53:12000 for pre-shared key authentication: first presentation.
2015-05-10T03:48:39.604550 [DEBUG] Sending SESSION_REQUEST to AAA.BB.CC.53:12000.
2015-05-10T03:48:39.620551 [TRACE] Sending session request message to AAA.BB.CC.53:12000 (next_session_number: 0, local_host_identifier: 232637cbc0fa3e41d911adbb86e82454bc4f873d8b1910b8261f9c58425038c2)
2015-05-10T03:48:39.692556 [DEBUG] Received SESSION from AAA.BB.CC.53:12000 (default: accept).
2015-05-10T03:48:39.706556 [DEBUG] Cipher suite: ecdhe_rsa_aes256_gcm_sha384
2015-05-10T03:48:39.720557 [DEBUG] Elliptic curve: sect571k1
2015-05-10T03:48:39.731558 [TRACE] Received a SESSION from AAA.BB.CC.53:12000 with session number 0 but no session was prepared yet. Preparing a new one.
2015-05-10T03:48:39.774560 [TRACE] Session established with AAA.BB.CC.53:12000. Sending acknowledgement session message back.
2015-05-10T03:48:39.795561 [TRACE] Sending session message to AAA.BB.CC.53:12000 (session number: 0, cipher suite: ecdhe_rsa_aes256_gcm_sha384, elliptic curve: sect571k1).
2015-05-10T03:48:39.827563 [IMPORTANT] Session established with AAA.BB.CC.53:12000.
2015-05-10T03:48:39.855565 [INFORMATION] Cipher suite: ecdhe_rsa_aes256_gcm_sha384
2015-05-10T03:48:39.875566 [INFORMATION] Elliptic curve: sect571k1
2015-05-10T03:48:39.890567 [DEBUG] Sending routes request to AAA.BB.CC.53:12000.
2015-05-10T03:48:39.907568 [INFORMATION] Added system route: Network interface #14 - AAA.BB.CC.53/32 => 192.168.0.1 - metric 0
2015-05-10T03:48:39.975572 [TRACE] Received a SESSION from AAA.BB.CC.53:12000 with session number 0 and cipher suite ecdhe_rsa_aes256_gcm_sha384. A session currently exists and has the same number and cipher suite. Ignoring.
2015-05-10T03:48:40.007574 [DEBUG] Received routes request from AAA.BB.CC.53:12000. Replying with version 0:
2015-05-10T03:48:40.026575 [DEBUG] Sending routes and DNS servers to AAA.BB.CC.53:12000: version 0 (), (25.0.0.5).
2015-05-10T03:48:41.442656 [DEBUG] Received HELLO_REQUEST from AAA.BB.CC.35:12000.
2015-05-10T03:48:41.485658 [TRACE] Received a SESSION_REQUEST from AAA.BB.CC.35:12000 but no presentation is available. Ignoring.
2015-05-10T03:48:41.504659 [DEBUG] Received PRESENTATION from AAA.BB.CC.35:12000 using pre-shared key authentication.
2015-05-10T03:48:41.522660 [INFORMATION] Accepting PRESENTATION from AAA.BB.CC.35:12000 for pre-shared key authentication: first presentation.
2015-05-10T03:48:41.544661 [DEBUG] Sending SESSION_REQUEST to AAA.BB.CC.35:12000.
2015-05-10T03:48:41.558662 [TRACE] Sending session request message to AAA.BB.CC.35:12000 (next_session_number: 0, local_host_identifier: 0182b8d34a6a2775708a3c0876f854e9e3d45a8dab8bf37242417d367376f937)
2015-05-10T03:48:41.627666 [DEBUG] Received SESSION from AAA.BB.CC.35:12000 (default: accept).
2015-05-10T03:48:41.641667 [DEBUG] Cipher suite: ecdhe_rsa_aes256_gcm_sha384
2015-05-10T03:48:41.656668 [DEBUG] Elliptic curve: sect571k1
2015-05-10T03:48:41.668669 [TRACE] Received a SESSION from AAA.BB.CC.35:12000 with session number 0 but no session was prepared yet. Preparing a new one.
2015-05-10T03:48:41.714671 [TRACE] Session established with AAA.BB.CC.35:12000. Sending acknowledgement session message back.
2015-05-10T03:48:41.733672 [TRACE] Sending session message to AAA.BB.CC.35:12000 (session number: 0, cipher suite: ecdhe_rsa_aes256_gcm_sha384, elliptic curve: sect571k1).
2015-05-10T03:48:41.760674 [IMPORTANT] Session established with AAA.BB.CC.35:12000.
2015-05-10T03:48:41.777675 [INFORMATION] Cipher suite: ecdhe_rsa_aes256_gcm_sha384
2015-05-10T03:48:41.793676 [INFORMATION] Elliptic curve: sect571k1
2015-05-10T03:48:41.808677 [DEBUG] Sending routes request to AAA.BB.CC.35:12000.
2015-05-10T03:48:41.825678 [INFORMATION] Added system route: Network interface #14 - AAA.BB.CC.35/32 => 192.168.0.1 - metric 0
2015-05-10T03:48:41.890681 [TRACE] Received a SESSION from AAA.BB.CC.35:12000 with session number 0 and cipher suite ecdhe_rsa_aes256_gcm_sha384. A session currently exists and has the same number and cipher suite. Ignoring.
2015-05-10T03:48:41.924683 [DEBUG] Received routes request from AAA.BB.CC.35:12000. Replying with version 0:
2015-05-10T03:48:41.947685 [DEBUG] Sending routes and DNS servers to AAA.BB.CC.35:12000: version 0 (), (25.0.0.5).
2015-05-10T03:48:45.352879 [DEBUG] Received HELLO_REQUEST from 192.168.0.100:12000.
2015-05-10T03:48:45.366880 [DEBUG] Received PRESENTATION from 192.168.0.100:12000 using pre-shared key authentication.
2015-05-10T03:48:45.378881 [INFORMATION] Accepting PRESENTATION from 192.168.0.100:12000 for pre-shared key authentication: first presentation.
2015-05-10T03:48:45.396882 [DEBUG] Sending SESSION_REQUEST to 192.168.0.100:12000.
2015-05-10T03:48:45.413883 [TRACE] Sending session request message to 192.168.0.100:12000 (next_session_number: 0, local_host_identifier: 60463ef586abf360f9b8294fb7b6a5991cbcf5d2875a78c54bad6570549706d7)
2015-05-10T03:48:45.446885 [DEBUG] Received SESSION_REQUEST from 192.168.0.100:12000 (default: accept).
2015-05-10T03:48:45.464886 [DEBUG] Cipher suites capabilities: ecdhe_rsa_aes256_gcm_sha384 ecdhe_rsa_aes128_gcm_sha256
2015-05-10T03:48:45.485887 [DEBUG] Elliptic curve capabilities: sect571k1 secp384r1
2015-05-10T03:48:45.503888 [TRACE] Received a SESSION_REQUEST from 192.168.0.100:12000 with session number 0 and cipher suite ecdhe_rsa_aes256_gcm_sha384_sect571k1. No current session exist: preparing one and sending it.
2015-05-10T03:48:45.542890 [TRACE] Sending session message to 192.168.0.100:12000 (session number: 0, cipher suite: ecdhe_rsa_aes256_gcm_sha384, elliptic curve: sect571k1).
2015-05-10T03:48:45.577892 [DEBUG] Received SESSION from 192.168.0.100:12000 (default: accept).
2015-05-10T03:48:45.593893 [DEBUG] Cipher suite: ecdhe_rsa_aes256_gcm_sha384
2015-05-10T03:48:45.609894 [DEBUG] Elliptic curve: sect571k1
2015-05-10T03:48:45.630895 [TRACE] Session established with 192.168.0.100:12000. Sending acknowledgement session message back.
2015-05-10T03:48:45.691899 [TRACE] Sending session message to 192.168.0.100:12000 (session number: 0, cipher suite: ecdhe_rsa_aes256_gcm_sha384, elliptic curve: sect571k1).
2015-05-10T03:48:45.722900 [IMPORTANT] Session established with 192.168.0.100:12000.
2015-05-10T03:48:45.739901 [INFORMATION] Cipher suite: ecdhe_rsa_aes256_gcm_sha384
2015-05-10T03:48:45.755902 [INFORMATION] Elliptic curve: sect571k1
2015-05-10T03:48:45.769903 [DEBUG] Sending routes request to 192.168.0.100:12000.
2015-05-10T03:48:45.785904 [INFORMATION] Added system route: Network interface #14 - 192.168.0.100/32 - metric 0
2015-05-10T03:48:45.803905 [TRACE] Received a SESSION from 192.168.0.100:12000 with session number 0 and cipher suite ecdhe_rsa_aes256_gcm_sha384. A session currently exists and has the same number and cipher suite. Ignoring.
2015-05-10T03:48:45.836907 [DEBUG] Received routes request from 192.168.0.100:12000. Replying with version 0:
2015-05-10T03:48:45.856908 [DEBUG] Sending routes and DNS servers to 192.168.0.100:12000: version 0 (), (25.0.0.5).
2015-05-10T03:49:04.794991 [DEBUG] Received routes request from AAA.BB.CC.53:12000. Replying with version 0:
2015-05-10T03:49:04.858995 [DEBUG] Sending routes and DNS servers to AAA.BB.CC.53:12000: version 0 (), (25.0.0.5).
2015-05-10T03:49:29.098381 [DEBUG] Received routes request from AAA.BB.CC.35:12000. Replying with version 0:
2015-05-10T03:49:29.118383 [DEBUG] Sending routes and DNS servers to AAA.BB.CC.35:12000: version 0 (), (25.0.0.5).
2015-05-10T03:49:29.372397 [DEBUG] Received routes request from AAA.BB.CC.54:12000. Replying with version 0:
2015-05-10T03:49:29.382398 [DEBUG] Sending routes and DNS servers to AAA.BB.CC.54:12000: version 0 (), (25.0.0.5).
2015-05-10T03:49:37.265849 [DEBUG] Received routes request from AAA.BB.CC.55:12000. Replying with version 0:
2015-05-10T03:49:37.281849 [DEBUG] Sending routes and DNS servers to AAA.BB.CC.55:12000: version 0 (), (25.0.0.5).
2015-05-10T03:50:45.347743 [DEBUG] Received routes request from 192.168.0.100:12000. Replying with version 0:
2015-05-10T03:50:45.357743 [DEBUG] Sending routes and DNS servers to 192.168.0.100:12000: version 0 (), (25.0.0.5).
2015-05-10T03:51:13.178334 [DEBUG] Sending routes request to all hosts.
2015-05-10T03:52:04.652279 [WARNING] Signal caught (2): exiting...
2015-05-10T03:52:04.798287 [DEBUG] Closing core...
2015-05-10T03:52:05.099304 [INFORMATION] Tap adapter's thread is now stopped.
2015-05-10T03:52:06.191367 [INFORMATION] Closing FSCP server...
2015-05-10T03:52:06.632392 [INFORMATION] FSCP server closed.
2015-05-10T03:52:06.640392 [DEBUG] Core closed.
2015-05-10T03:52:08.147478 [INFORMATION] Removed system route: Network interface #14 - AAA.BB.CC.55/32 => 192.168.0.1 - metric 0
2015-05-10T03:52:08.172480 [INFORMATION] Removed system route: Network interface #14 - AAA.BB.CC.54/32 => 192.168.0.1 - metric 0
2015-05-10T03:52:08.194481 [INFORMATION] Removed system route: Network interface #14 - AAA.BB.CC.53/32 => 192.168.0.1 - metric 0
2015-05-10T03:52:08.221483 [INFORMATION] Removed system route: Network interface #14 - AAA.BB.CC.35/32 => 192.168.0.1 - metric 0
2015-05-10T03:52:08.488498 [INFORMATION] Removed system route: Network interface #14 - 192.168.0.100/32 - metric 0
@bchavez Thanks for the thorough bug report.
I'll give a serious look at this today and keep you posted.
My hunch is that this has something to do with the new memory allocation scheme. Memory chunks are pre-allocated in a pool but if the pool is exhausted, then FreeLAN falls back to dynamic allocation. Dynamic blocks are added to the pool but never removed, resulting in a high usage of memory that never goes down.
@bchavez Could please detail the setup you used to reproduce this ?
I started 6 instances on the same Linux box (one central - 5 clients) and ping flooded the whole mesh but couldn't notice a memory increase.
I also tried with an OS X instance and Windows instances both connected to the same mesh I described, but couldn't notice a memory increase either.
I'm convinced there is something to improve there, but without any way of reproducing, it's going to be hard to do.
@bchavez I still couldn't reproduce the issue, but made an attempt to fix it anyway.
Could you please try the memory_usage
branch ?
Hi @ereOn,
Thanks so much for your help. Indeed, very strange. I'll try everything to help.
The setup topology is a star topology, specifically no nodes are allowed to talk to each other, only through a centralized hub node.
The overall topology is similar to:
[5 win nodes]
| (internet)
(firewall with port forwarding: 12000)
[Linux Hub]
/ \ (local LAN)
[win node] [win node]
(please excuse my weak attempt at ASCII art)
My configuration files:
Linux Hub:
[client]
public_endpoint=0.0.0.0
[fscp]
accept_contacts=no
[tap_adapter]
ipv4_address_prefix_length=25.0.0.5/8
[switch]
relay_mode_enabled=yes
[router]
local_dns_server=25.0.0.5
[security]
passphrase=hello
passphrase_salt=world
#Notes: Specifically removing ipv6, all freelan nodes have IPs that begin with 25.*.*.*
#The rest of the settings are default values left commented and not overridden
Windows node:
[client]
public_endpoint=0.0.0.0
[fscp]
contact=linux.hub.com:12000
accept_contact_requests=no
accept_contacts=no
[tap_adapter]
ipv4_address_prefix_length=25.0.0.201/8
[security]
passphrase=hello
passphrase_salt=world
# Below was used when testing with certificates:
#signature_certificate_file="C:\Program Files\FreeLAN\config\node01.crt"
#signature_private_key_file="C:\Program Files\FreeLAN\config\node01.key"
#authority_certificate_file="C:\Program Files\FreeLAN\config\ca.crt"
Other windows nodes are various increments to the IP and certificate path.
Also, I do have a few other windows nodes that I didn't update yet. So, maybe a possibility that the Freelan 1.1 nodes are interfering (DoS-ing) the Linux Hob through incompatible protocol verbiage? I'm not really sure, just trying to get you every detail I have.
Sure thing, I'll try recompiling the memory_branch
and let you know how it goes.
Edit: The Linux Hub is Ubuntu 14.04.2 The Windows nodes are a various mix of Windows Server and Windows 7.
@bchavez Don't bother right now: this commit introduces a bug that I'm trying to fix.
@ereOn Sure thing, whenever you're ready I'll be happy to help. Totally understandable, it's quite frustrating not being able to reproduce the issue. :(
@bchavez Okay, I hardcoded some limits in the memory allocation.
Could you test it ?
Yes sir, will test right now. BTW, I'm on IRC channel if needed.
Was this issue resolved?
@sbeckwithiii It might have been. There is actually no confirmation that the issue was real nor that the "fix" changed anything.
Hi! I haven't seen memory blows of freelan processes since i've discovered problem of selft-contact. Can you try to block local ip from contact list? "ipv4_address_prefix_length=10.18.0.10/23" - here you assign local ip "never_contact=10.18.0.10" - here you deny contacting local ip. also can you check #83
hi! I got this memory leak! I had freelan crashing, so reading this topic I've increased my swapfile by 50 GBs, but I do not remember what I've changed prior to problem appearing. I've got this leak several times before deciding to report it. I'm not changing my configurations for easier analysis.
How can I collect and send you debug data? I'm not a developer, but admin. If you will provide me step by step instruction, I will send it to you.
Does it happens with latest code from master repository or with some official version / Debian package ?
@s-vincent , for me the last time I checked, this happened with the Freelan 2.0 official release. The problem existed in both Windows and Linux Debian release apt
package. IIRC I also tested some versions built by @ereOn that had memory modifications presumably built from /master
at the time and the problem still existed.
Since the time of these tests, I've downgraded to Freelan 1.1 for now and everything has been somewhat stable. IIRC, the network throughput in Freelan 1.0 about an order of magnitude faster than 1.1; however, there was no way for me to get back to Freelan 1.0 Linux wise because the binaries in the apt-get
release 1.0 site were wiped and there's no GitHub source tag for 1.0.
@bchavez Any chance for you to test the latest git revision (master branch) and see if memory leak still happens ?
clearing 2.1 milestone until this can be successfully reproduced
Hi @s-vincent , sorry for the late reply. I'll be happy to test this if there are some windows binaries available. IIRC, it was painful last time I built freelan
from source to produce its associated binaries. It's been a while and I don't remember all the intricacies.
I have not binaries right now. Please see one of my comment of issue #93, I added step-by-step instructions for Windows some days ago. Only addition is to install cmake package.
as I mentioned in comment https://github.com/freelan-developers/freelan/issues/93#issuecomment-321739161 I think memory leak always caused by same windows machine.
To configure freelan I use sometimes WinSCP and was surprised by some troubles I've encountered long ago. There is an invisible symbol at the end of the line or something like that. Presence of this symbol at the end of the line ipv4_address_prefix_length= causes error but I do not see any instant effects of this symbol on lines signature_certificate_file= or signature_private_key_file= Recently I've used command "unix2dos freelan.cfg" on the windows machine causing problem and do not have leakage since.
Can you check effect of that symbol in ssl library?
What @richman1000000 is describing sounds like an incompatibility in CRLF to denote end-of-line for the config file in Windows. @ereOn and @s-vincent Can freelan or one of the libraries be thrown off the rails with an improper line-ending? @bchavez Do you write your config files in Windows or Ubuntu? Can you check the format of the line-ending (file will do that on Unix).
https://stackoverflow.com/questions/1552749/difference-between-cr-lf-lf-and-cr-line-break-types
@leggewie, for Windows the ipv4/6_address_prefix_length parameter goes to a netsh.exe command. On Linux it goes to netlink socket.
hello! Do you think freelan is affected by this parameter? /sys/kernel/mm/transparent_hugepage/enabled
@bchavez @richman1000000 Do you still experiment memory leak in latest 2.1 / master branch ?
current version I use is freelan 2.1-1 I did not had memory leak for half of year, but I always specify on all node like this
ipv4_address_prefix_length=VPNself_ip/prefix never_contact=VPNself_ip
TBH, I haven't had time to compile and test 2.1. :( I might revisit this next time I plan an upgrade. For now, I haven't moved off 2.0 stable (albeit with degraded performance compared to 1.0). I'm currently in a situation where I "can't/don't touch it" right now.
I will close this issue as fixed, then. Please ping us to reopen if you experience this again. Thank you for bringing this to our attention.
Unfortunately, I tested the latest branch. The memory leak still exists.
I'd appreciate if this issue was reopened.
Here is a video screen capture of the issue as it happened:
A little more about the topology where this happens. I have a setup similar to this:
The memory leak is happening on Linux A
. Sometimes high CPU loads on Windows F
.
Basically,
192.168.0.0
contact=Linux A
.contact=Linux A
, Windows C,D,E,F
.Linux E,F,G
don't have Internet, so, they need to be relayed by contact=Windows F
.Windows F
in turn, contact=Linux A
for connectivity.One way I've been able to reproduce the issue is:
/usr/bin/freelan
or freelan.exe
).Linux A
.Linux A
. When Linux A
comes online, it is hammered by everyone for connection requests. Debug log messages on Linux A
like this are common to see:
2018-10-20T20:44:15.576079 [TRACE] Received a data message from WindowsA:12021 but no session exists. Ignoring.
Linux A
becomes unresponsive. Memory leak.Kinda of feel there's a threading/concurrency/network state issue going on someplace. Soon as Linux A
comes online; it's hammered with connection requests from other nodes, but the other nodes on the network still "remember" previous network state/sessions before Linux A
went down and are trying to push data down a non-existent session on Linux A
. When Linux A
comes back with a fresh slate it grinds to a halt.
The network is no longer recoverable. All nodes need to be shut down and brought back up one-by-one, starting with Linux A
.
Please let me know if there's anything I can do to help provide more information. I'll try @richman1000000 's suggestion by adding never_connect=self_ip
and see if that helps.
Also, I'm thinking the Windows TAP9 adapter isn't properly cleared before starting up a new freelan.exe
. Soon as I started Windows C
's FreeLAN service, Linux A
received:
2018-10-20T22:19:52.227225 [TRACE] Received a data message from WindowsC:12021 but its sequence number is outdated (received: 3869, expecting: 4091). Ignoring.
2018-10-20T22:19:52.227474 [TRACE] Received a data message from WindowsC:12021 but its sequence number is outdated (received: 3870, expecting: 4091). Ignoring.
2018-10-20T22:19:52.227843 [TRACE] Received a data message from WindowsC:12021 but its sequence number is outdated (received: 3876, expecting: 4091). Ignoring.
Soon as Linux A
received this message, a memory leak occurred.
I tried @richman1000000 's suggestion of using never_contact=self_ip
and no luck.
I also tried perf
CPU profiling tools right at the moment when high CPU usage was experienced w/ memory leak. Nothing really sticks out to me... here's a dump of freelan-debug
executable (compiled with -g
debug compiler switch) executed with:
perf record -g ./freelan-debug -f -d -c freelan.cfg
Samples: 299K of event 'cpu-clock', Event count (approx.): 74841000000
+ 4.59% freelan-debug [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
+ 2.72% freelan-debug libc-2.19.so [.] _int_malloc
+ 2.34% freelan-debug libc-2.19.so [.] _int_free
+ 2.27% freelan-debug libc-2.19.so [.] malloc
+ 1.92% freelan-debug [kernel.kallsyms] [k] __do_softirq
+ 1.60% freelan-debug [kernel.kallsyms] [k] e1000_xmit_frame
+ 1.26% freelan-debug [kernel.kallsyms] [k] finish_task_switch
+ 1.22% freelan-debug freelan-debug [.] boost::detail::atomic_increment(int*)
+ 1.19% freelan-debug libcrypto.so.1.0.0 [.] 0x00000000000e90e5
+ 1.07% freelan-debug freelan-debug [.] boost::detail::atomic_exchange_and_add(int*, int)
+ 1.00% freelan-debug freelan-debug [.] boost::detail::shared_count::shared_count(boost::detail::shared_count const&)
+ 0.92% freelan-debug freelan-debug [.] boost::variant<freelan::null_port_index_type, freelan::tap_adapter_port_index_type, freelan::endpoint_port_index_type, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::vo
+ 0.89% freelan-debug libc-2.19.so [.] free
+ 0.84% freelan-debug freelan-debug [.] boost::detail::function::functor_manager<boost::function<void (boost::system::error_code)> >::manager(boost::detail::function::function_buffer const&, boost::detail::function::function_buffer&, boost::detail
+ 0.81% freelan-debug [kernel.kallsyms] [k] sys_epoll_ctl
+ 0.76% freelan-debug freelan-debug [.] boost::detail::shared_count::~shared_count()
+ 0.72% freelan-debug freelan-debug [.] boost::detail::function::functor_manager<boost::function<void (boost::system::error_code)> >::manage(boost::detail::function::function_buffer const&, boost::detail::function::function_buffer&, boost::detail:
+ 0.67% freelan-debug freelan-debug [.] boost::function1<void, boost::system::error_code const&>::clear()
+ 0.63% freelan-debug freelan-debug [.] boost::detail::function::functor_manager<boost::_bi::bind_t<void, boost::_mfi::mf2<void, freelan::(anonymous namespace)::results_gatherer<boost::variant<freelan::null_port_index_type, freelan::tap_adapter_po
+ 0.62% freelan-debug freelan-debug [.] boost::detail::function::functor_manager<boost::_bi::bind_t<void, boost::_mfi::mf2<void, freelan::(anonymous namespace)::results_gatherer<boost::variant<freelan::null_port_index_type, freelan::tap_adapter_po
+ 0.57% freelan-debug freelan-debug [.] boost::function1<void, boost::system::error_code>::clear()
+ 0.53% freelan-debug freelan-debug [.] boost::variant<freelan::null_port_index_type, freelan::tap_adapter_port_index_type, freelan::endpoint_port_index_type, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::vo
+ 0.50% freelan-debug freelan-debug [.] boost::function1<void, boost::system::error_code>::assign_to_own(boost::function1<void, boost::system::error_code> const&)
Samples: 299K of event 'cpu-clock', Event count (approx.): 74841000000
- 4.59% freelan-debug [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore ?
- _raw_spin_unlock_irqrestore ¦
- 89.23% mptspi_qcmd ¦
scsi_dispatch_cmd ¦
scsi_request_fn ¦
+ __blk_run_queue ¦
+ 2.93% try_to_wake_up ¦
+ 1.97% __wake_up_sync_key ¦
+ 1.75% pagevec_lru_move_fn ¦
+ 1.49% ep_poll ¦
+ 0.76% ep_scan_ready_list.isra.9 ¦
+ 0.71% __skb_recv_datagram ¦
- 2.72% freelan-debug libc-2.19.so [.] _int_malloc ¦
_int_malloc ¦
- 2.34% freelan-debug libc-2.19.so [.] _int_free ¦
- _int_free ¦
+ 43.64% boost::detail::function::functor_manager<boost::_bi::bind_t<void, boost::_mfi::mf2<void, freelan::(anonymous namespace)::results_gatherer<boost::variant<freelan::null_port_index_type, freelan::tap_adapter_port_index_type, freelan::endpoint_port_index_type¦
+ 26.27% boost::detail::function::functor_manager<boost::function<void (boost::system::error_code)> >::manager(boost::detail::function::function_buffer const&, boost::detail::function::function_buffer&, boost::detail::function::functor_manager_operation_type, boos¦
+ 6.93% boost::detail::function::functor_manager<fscp::server::async_send_to(fscp::SharedBuffer const&, unsigned long, boost::asio::ip::basic_endpoint<boost::asio::ip::udp> const&, boost::function<void (boost::system::error_code const&)>)::{lambda()#1}>::manager(b¦
+ 5.27% boost::detail::function::functor_manager<boost::asio::detail::wrapped_handler<boost::asio::io_service&, boost::function<void (boost::system::error_code const&)>, boost::asio::detail::is_continuation_delegated> >::manager(boost::detail::function::function_b¦
+ 4.24% std::_Vector_base<unsigned char, std::allocator<unsigned char> >::_M_deallocate(unsigned char*, unsigned long) ¦
+ 2.38% boost::asio::asio_handler_deallocate(void*, unsigned long, ...) ¦
+ 1.76% boost::detail::sp_counted_base::destroy() ¦
+ 1.65% boost::detail::function::functor_manager<fscp::SharedBufferHandler<void (*)(std::map<boost::variant<freelan::null_port_index_type, freelan::tap_adapter_port_index_type, freelan::endpoint_port_index_type, boost::detail::variant::void_, boost::detail::varian¦
+ 1.45% std::_Rb_tree<boost::variant<freelan::null_port_index_type, freelan::tap_adapter_port_index_type, freelan::endpoint_port_index_type, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_,¦
+ 1.34% fscp::data_message::get_cleartext(void*, unsigned long, cryptoplus::cipher::cipher_algorithm, void const*, unsigned long, void const*, unsigned long) const ¦
+ 1.34% boost::asio::asio_handler_allocate(unsigned long, ...) ¦
+ 1.03% std::_List_base<fscp::SharedBuffer, std::allocator<fscp::SharedBuffer> >::_M_put_node(std::_List_node<fscp::SharedBuffer>*) ¦
+ 0.93% boost::detail::sp_counted_impl_pd<fscp::SharedBuffer*, fscp::SharedBuffer::SharedBuffer<fscp::server::do_async_receive_from()::{lambda(fscp::SharedBuffer const&)#2}>(fscp::SharedBuffer const&, fscp::server::do_async_receive_from()::{lambda(fscp::SharedBuff¦
+ 0.72% fscp::data_message::raw_write(void*, unsigned long, unsigned int, cryptoplus::cipher::cipher_algorithm, void const*, unsigned long, void const*, unsigned long, void const*, unsigned long, fscp::message_type) ¦
+ 0.52% std::_Rb_tree<boost::variant<freelan::null_port_index_type, freelan::tap_adapter_port_index_type, freelan::endpoint_port_index_type, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_,¦
- 2.27% freelan-debug libc-2.19.so [.] malloc ¦
- malloc ¦
+ 17.20% boost::detail::function::functor_manager<boost::asio::detail::wrapped_handler<boost::asio::io_service&, boost::function<void (boost::system::error_code const&)>, boost::asio::detail::is_continuation_delegated> >::manager(boost::detail::function::function_¦
+ 16.53% boost::detail::function::functor_manager<boost::function<void (boost::system::error_code)> >::manager(boost::detail::function::function_buffer const&, boost::detail::function::function_buffer&, boost::detail::function::functor_manager_operation_type, boos¦
+ 16.19% boost::asio::asio_handler_allocate(unsigned long, ...) ¦
+ 11.80% boost::detail::function::functor_manager<boost::_bi::bind_t<void, boost::_mfi::mf2<void, freelan::(anonymous namespace)::results_gatherer<boost::variant<freelan::null_port_index_type, freelan::tap_adapter_port_index_type, freelan::endpoint_port_index_type¦
+ 8.60% boost::detail::function::functor_manager<fscp::server::async_send_to(fscp::SharedBuffer const&, unsigned long, boost::asio::ip::basic_endpoint<boost::asio::ip::udp> const&, boost::function<void (boost::system::error_code const&)>)::{lambda()#1}>::manager(b¦
+ 6.07% std::_Vector_base<unsigned char, std::allocator<unsigned char> >::_M_allocate(unsigned long) ¦
+ 3.54% std::_List_base<fscp::SharedBuffer, std::allocator<fscp::SharedBuffer> >::_M_get_node() ¦
+ 3.37% bool boost::detail::function::basic_vtable1<void, boost::system::error_code>::assign_to<boost::_bi::bind_t<void, boost::_mfi::mf2<void, freelan::(anonymous namespace)::results_gatherer<boost::variant<freelan::null_port_index_type, freelan::tap_adapter_port¦
+ 3.20% bool boost::detail::function::basic_vtable1<void, std::map<boost::variant<freelan::null_port_index_type, freelan::tap_adapter_port_index_type, freelan::endpoint_port_index_type, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::v¦
2.70% 0x203d3c206c620063 ¦
+ 2.70% boost::shared_array<unsigned char>::shared_array<unsigned char>(unsigned char*) ¦
+ 2.36% fscp::server::do_handle_data(fscp::identity_store const&, boost::asio::ip::basic_endpoint<boost::asio::ip::udp> const&, fscp::data_message const&) ¦
+ 2.02% std::_Rb_tree<boost::variant<freelan::null_port_index_type, freelan::tap_adapter_port_index_type, freelan::endpoint_port_index_type, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_,¦
+ 1.18% std::_Rb_tree<boost::variant<freelan::null_port_index_type, freelan::tap_adapter_port_index_type, freelan::endpoint_port_index_type, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_,¦
+ 0.51% fscp::server::do_async_receive_from() ¦
+ 1.92% freelan-debug [kernel.kallsyms] [k] __do_softirq ¦
+ 1.60% freelan-debug [kernel.kallsyms] [k] e1000_xmit_frame ¦
+ 1.26% freelan-debug [kernel.kallsyms] [k] finish_task_switch ¦
- 1.22% freelan-debug freelan-debug [.] boost::detail::atomic_increment(int*) ¦
- boost::detail::atomic_increment(int*) ¦
- 98.38% boost::detail::sp_counted_base::add_ref_copy() ¦
- boost::detail::shared_count::shared_count(boost::detail::shared_count const&) ¦
- 37.13% boost::shared_ptr<freelan::(anonymous namespace)::results_gatherer<boost::variant<freelan::null_port_index_type, freelan::tap_adapter_port_index_type, freelan::endpoint_port_index_type, boost::detail::variant::void_, boost::detail::variant::void_, b¦
- 96.76% boost::_bi::value<boost::shared_ptr<freelan::(anonymous namespace)::results_gatherer<boost::variant<freelan::null_port_index_type, freelan::tap_adapter_port_index_type, freelan::endpoint_port_index_type, boost::detail::variant::void_, boost::deta¦
- 93.08% boost::_bi::storage1<boost::_bi::value<boost::shared_ptr<freelan::(anonymous namespace)::results_gatherer<boost::variant<freelan::null_port_index_type, freelan::tap_adapter_port_index_type, freelan::endpoint_port_index_type, boost::detail::var¦
boost::_bi::storage2<boost::_bi::value<boost::shared_ptr<freelan::(anonymous namespace)::results_gatherer<boost::variant<freelan::null_port_index_type, freelan::tap_adapter_port_index_type, freelan::endpoint_port_index_type, boost::detail::variant¦
boost::_bi::storage3<boost::_bi::value<boost::shared_ptr<freelan::(anonymous namespace)::results_gatherer<boost::variant<freelan::null_port_index_type, freelan::tap_adapter_port_index_type, freelan::endpoint_port_index_type, boost::detail::variant¦
+ boost::_bi::list3<boost::_bi::value<boost::shared_ptr<freelan::(anonymous namespace)::results_gatherer<boost::variant<freelan::null_port_index_type, freelan::tap_adapter_port_index_type, freelan::endpoint_port_index_type, boost::detail::variant::v¦
+ 2.02% boost::_bi::storage1<boost::_bi::value<boost::shared_ptr<freelan::(anonymous namespace)::results_gatherer<boost::variant<freelan::null_port_index_type, freelan::tap_adapter_port_index_type, freelan::endpoint_port_index_type, boost::detail::vari¦
+ 1.79% boost::_bi::storage2<boost::_bi::value<boost::shared_ptr<freelan::(anonymous namespace)::results_gatherer<boost::variant<freelan::null_port_index_type, freelan::tap_adapter_port_index_type, freelan::endpoint_port_index_type, boost::detail::vari¦
+ 1.79% boost::_bi::list3<boost::_bi::value<boost::shared_ptr<freelan::(anonymous namespace)::results_gatherer<boost::variant<freelan::null_port_index_type, freelan::tap_adapter_port_index_type, freelan::endpoint_port_index_type, boost::detail::variant¦
+ 1.32% boost::_bi::storage3<boost::_bi::value<boost::shared_ptr<freelan::(anonymous namespace)::results_gatherer<boost::variant<freelan::null_port_index_type, freelan::tap_adapter_port_index_type, freelan::endpoint_port_index_type, boost::detail::vari¦
+ 1.88% boost::_bi::value<boost::shared_ptr<freelan::(anonymous namespace)::results_gatherer<boost::variant<freelan::null_port_index_type, freelan::tap_adapter_port_index_type, freelan::endpoint_port_index_type, boost::detail::variant::void_, boost::detai¦
+ 1.35% freelan::switch_::async_write(boost::variant<freelan::null_port_index_type, freelan::tap_adapter_port_index_type, freelan::endpoint_port_index_type, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost¦
+ 22.46% boost::shared_ptr<fscp::SharedBuffer>::shared_ptr(boost::shared_ptr<fscp::SharedBuffer> const&) ¦
+ 13.64% boost::shared_array<unsigned char>::shared_array(boost::shared_array<unsigned char> const&) ¦
+ 13.61% boost::shared_ptr<asiotap::posix_tap_adapter>::shared_ptr(boost::shared_ptr<asiotap::posix_tap_adapter> const&) ¦
+ 5.50% boost::shared_ptr<x509_st>::shared_ptr(boost::shared_ptr<x509_st> const&) ¦
+ 4.39% boost::shared_ptr<evp_pkey_st>::shared_ptr(boost::shared_ptr<evp_pkey_st> const&) ¦
+ 2.96% boost::shared_ptr<boost::asio::ip::basic_endpoint<boost::asio::ip::udp> >::shared_ptr(boost::shared_ptr<boost::asio::ip::basic_endpoint<boost::asio::ip::udp> > const&) ¦
+ 1.62% boost::detail::shared_count::shared_count(boost::detail::shared_count const&)
However, I do see a significant amount of traffic hit the VPN network at the moment high CPU usage and memory leak occur:
Hello! also on linux you can try echo never > /sys/kernel/mm/transparent_hugepage/enabled
Quick update, I've been making some headway tracking down this memory leak.
I have a setup that can finally reproduce the issue with full IDE debugging support on Windows. Basically, I swapped Linux A
with Windows A
.
Below is a heap allocation snapshot from 0 seconds to 74 seconds with four heap snapshots taken along the way. You'll notice a massive heap allocation between Snapshot 3
and Snapshot 4
. In a few seconds, we go from a nominal memory usage to +200MB total bytes (+166MB from Snapshot 3
to Snapshot 4
).
Next, let's take a closer look at where the bulk of the +166MB of allocations of unsigned char[]
came from between Snapshot 3
and Snapshot 4
. Below is a sample instance of such an allocation and its associated call stack:
The vast majority of allocations instances are coming from the call stack shown above. Prime suspects are fscp:do_send_data
-> fscp:do_send_data_to_session
using the SharedBuffer
in some kind of attempt to send data:
https://github.com/freelan-developers/freelan/blob/787f0a45b19e33cbc53d718faa6b90b7f25a8c5f/libs/fscp/src/server.cpp#L2071-L2077
As I mentioned earlier, the memory leak (or high CPU usage) usually manifests itself when FreeLAN is configured in a star topology with about 10 nodes connected through 1 central node.
freelan
service is restarted, upon restart, the new freelan
service process on the central node will exhibit either high memory or high CPU utilization.
My gut feeling right now, after the central node restarts sessions with the other nodes on the overlay some kind of ambient state from the edge nodes interferes with the central node's ability to fscp:do_send_data_to_session
. The sends are indefinitely queued inside the central node, but never cleared, and thus, resulting in a memory leak.
If anyone has more thoughts on the subject, please let me know. :sunglasses:
Thanks, Brian
:dash: :walking: "Bubbles of gas in my brain... Send me off balance, it's not enough"
Hi @bchavez, thanks you very much for deep tests and investigation.
After quick look at the code, I also agree it may be related to SharedBuffer. I will try to work on that this week-end.
Hello,
I did more testing and was able to find a reliable way to reproduce the memory leak with just 2 nodes.
However, in researching the issue, the instructions (and code) to reproduce the issue the might potentially expose a Denial of Service vulnerability in FreeLAN. Please let me know the best way to disclose the issue. I can do so publicly (by creating a new GitHub issue) or privately.
If privately:
Here is a demo of the memory leak and Denial of Service on demand and in action:
/cc @ereOn @s-vincent
Thanks, Brian
:beach_umbrella: :trumpet: Beach Boys - Good Vibrations (Nick Warren bootleg)
@bchavez I finally have some time to look at this issue. I already reproduce memory leaks with your scenario. Now time to investigate in source code.
Here is what I see:
My thought about that:
I decrease a bit the size of packet buffer and there is way less memory consumption.
We have two main solutions to handle this:
@ereOn is it normal to have such a high buffer size (65536) ? UDP protocol allows it but we are still limited by MTU of the interface to send it in one piece or else use IP fragmentation. But IP fragmentation is not allowed by router on the internet AFAIK (and performance is not good).
@bchavez Can you test with a SharedBuffer size of 4096 (instead of 65536) in libs/fscp/src/server.cpp ?
I would personnaly go for second proposition. At least we can let people test it in real use-cases and see if they have problems with packet size.
What do you think ?
Hi @s-vincent ,
I'm happy you were able to reproduce the issue. Also, thanks very much for looking into the issue. You bring up some great points.
I did some testing with 4096 SharedBuffer
size, and the situation seems to be greatly improved. However, IMHO, we still have a problem with monotonically increasing memory usage. In the image below, I ran three flood events.
Initially, when the first flood of traffic occurs, we instantly get an increase in memory usage. The 4096 buffer memory ramp up is much less that before; which is great. However, if we sustain this level of traffic against the host, the host will continue consuming memory albeit at a slower rate. It should be noted, that we do have a reliable way to consume all memory on the target host. Additionally, the consumed memory is never reclaimed by the operating system after the floods stop.
I think we need an upper-bound limit or have some mechanism that frees these SharedBuffer
s after a period of time.
Also, as a quick side note, during a flood event, the processing of packets on the FreeLAN generally stops anyway. So, the consequences of blocking processing in solution No.1 may not matter much?
I will continue to do more testing this weekend.
Thanks again, Brian
:fallen_leaf: :leaves: Distance - Falling (ft Alys Be)
PS. If anyone is interested in help test the 4069 SharedBuffer
changes, I have build artifacts available here:
debian/unstable
branch here:
https://github.com/bchavez/freelan/tree/debian/unstableTo build .deb
package, make sure this commit is present in your checkout: https://github.com/bchavez/freelan/commit/81601fad7c2cf441c921b52f3941943fd7f24c06. The commit fixes the broken debian packaging system in FreeLAN.
In detail:
debian/unstable
branch.
git checkout origin/debian/unstable
dpkg-buildpackage -b -rfakeroot -us -uc
(or dh binary
) in freelan
root checkout folder. Wait a long time, freelan will build. Then out comes a cute .deb
file! hey! :)(tested on Ubuntu 14.04 and 18.04)
2018-10-20T20:44:15.576079 [TRACE] Received a data message from WindowsA:12021 but no session exists. Ignoring
https://github.com/freelan-developers/freelan/issues/158#issuecomment-325555490 Previously I was writing that Freelan must not Ignore incorrect session connection, and send answer (like session ID:invalid or session ID:terminate) to client - the client will know that session is not valid anymore and terminate session data in buffer.
I think we have a circular references issue for the recycling mechanism.
The SharedBuffer holds a shared_ptr which have a deleter based on lambda expressions. During a time lambda expression capture a reference of the SharedBuffer. Even with limiting buffer list and shrinking when necessary memory never be reclaimed.
This link is interresting on the shared_ptr/lambda issue: https://floating.io/2017/07/lambda-shared_ptr-memory-leak/.
In 82830f2900110bf855014bf2ca3176db2c50c539 I disabled the recycling mechanism.
@bchavez Can you please test this one ?
@ereOn This stuff seems difficult to solve it with current SharedBuffer architecture. Also the effective recycle action happened in a (possibly) another strand so asynchronically which adds complexity. I will try to think about a solution but if you have something in your mind, let's share it :).
Thanks @s-vincent , I'll test the new changes with disabled recycling.
SharedBuffer
Just wanted to add an extra note on testing the 4096 SharedBuffer
changes. Under a real-world deployment as described in the comment here: https://github.com/freelan-developers/freelan/issues/74#issuecomment-431638703, I bootstrapped the FreeLAN network bringing up each node one by one starting with Linux A
After the network was stable, I rebooted Linux A. Linux A experienced very high CPU usage after reboot. I could not recover network communication between any nodes. I still think there's something fundamentally incorrect under the hood as @richman1000000 alluded to. So I'm glad @s-vincent was able to identify something more fundamental that could be causing issues.
@richman1000000 , I think you have a good idea but we should also keep in mind that any host that experiences issues like this should be able to handle invalid sessions without the help or aid of other hosts resetting their buffers. With networking code, we can't assume all nodes are trustworthy. A malicious attacker can choose to ignore any "reset" messages and still cause harm to the target host.
I'll report back with results on disabled recycling shortly.
Hi @s-vincent ,
I tested the disabled recycling mechanism and we have very good results. Excellent work Sebastien. We now have a sawtooth pattern in memory usage when flood attacks occur. Memory is now reclaimed by the operating system:
The freelan.exe
recovers to a respectable 13 MB. Peak memory usage during a flood attack reached up to 237 MB (unbounded) and memory consumption only lasted for the duration of the flood attack.
Overall, this is a very good step in the right direction. I will continue to do more testing.
Thanks, Brian
:car: :blue_car: "Let the good times roll..."
Okay, I did more testing with disabled recycling mechanism, but this time, with a real-world setup with 13 VPN nodes:
If you restart Linux A, the freelan
process consumes 50%-90% CPU and will experience unbounded high memory usage. Sometimes its only high CPU usage, other times, it's high CPU usage & memory leak:
So, we still have a memory leak somewhere. The VPN network cannot recover after Linux A restarts. The freelan
process on Linux A gets wildly out of control and Linux A becomes unresponsive to terminal input.
SharedBuffer
using the lambda for disposal? If so, we might consider removing them also.I'll try to run a profiler to see if I can pick up where these exhaustive CPU cycles are being spent.
-Brian
Hi @bchavez,
Thanks for testing!
In f499f9a3eae3318ff38ffcb69978c74322026440 I disable recycling for socket receive buffers. This is the last place where we use recycle mechanism.
Thanks @s-vincent,
Results are roughly the same as last recycling tests.
In the real-world test topology with 13 nodes, when Linux A reboots, we still have either: 1) High CPU usage on Linux A or 2) High CPU usage & memory leak on Linux A.
After Linux A reboot... when we have high CPU usage on Linux A, typically there is a burst of traffic on the VPN network seen on almost all nodes. However, there's usually one node on the 13 node VPN network that will sustain high "echo"/chatter traffic with Linux A. The selection of the node is non-deterministic and appears to be randomly chosen. The image below is a sample of the this VPN "echo" traffic pattern between both VPN hosts (the random node and Linux A):
The log for the random node engaged with high echo traffic with Linux A (192.168.0.5) is shown below:
2018-12-15T12:28:45.695044 [TRACE] Received a data message from 192.168.0.5:12021 but its sequence number is outdated (received: 143037, expecting: 143039). Ignoring.
2018-12-15T12:28:46.114068 [TRACE] Received a data message from 192.168.0.5:12021 but its sequence number is outdated (received: 144912, expecting: 144913). Ignoring.
2018-12-15T12:28:46.162071 [TRACE] Received a data message from 192.168.0.5:12021 but its sequence number is outdated (received: 149536, expecting: 149538). Ignoring.
2018-12-15T12:28:46.163071 [TRACE] Received a data message from 192.168.0.5:12021 but its sequence number is outdated (received: 149630, expecting: 149632). Ignoring.
2018-12-15T12:28:46.229075 [TRACE] Received a data message from 192.168.0.5:12021 but its sequence number is outdated (received: 156292, expecting: 156298). Ignoring.
2018-12-15T12:28:46.239075 [TRACE] Received a data message from 192.168.0.5:12021 but its sequence number is outdated (received: 157383, expecting: 157384). Ignoring.
2018-12-15T12:28:46.274077 [TRACE] Received a data message from 192.168.0.5:12021 but its sequence number is outdated (received: 161073, expecting: 161074). Ignoring.
2018-12-15T12:28:46.289078 [TRACE] Received a data message from 192.168.0.5:12021 but its sequence number is outdated (received: 162609, expecting: 162611). Ignoring.
2018-12-15T12:28:46.481089 [TRACE] Received a data message from 192.168.0.5:12021 but its sequence number is outdated (received: 180586, expecting: 180592). Ignoring.
2018-12-15T12:28:46.687101 [TRACE] Received a data message from 192.168.0.5:12021 but its sequence number is outdated (received: 199147, expecting: 199151). Ignoring.
2018-12-15T12:28:46.755105 [TRACE] Received a data message from 192.168.0.5:12021 but its sequence number is outdated (received: 205321, expecting: 205332). Ignoring.
2018-12-15T12:28:46.773106 [TRACE] Received a data message from 192.168.0.5:12021 but its sequence number is outdated (received: 206956, expecting: 206962). Ignoring.
2018-12-15T12:28:46.809108 [TRACE] Received a data message from 192.168.0.5:12021 but its sequence number is outdated (received: 209948, expecting: 209952). Ignoring.
2018-12-15T12:28:46.935115 [TRACE] Received a data message from 192.168.0.5:12021 but its sequence number is outdated (received: 220466, expecting: 220471). Ignoring.
2018-12-15T12:28:50.471317 [DEBUG] Resolving 192.168.0.5:12021 for potential contact...
2018-12-15T12:28:50.471317 [DEBUG] A session already exists with 192.168.0.5:12021 (at 192.168.0.5:12021). Not contacting again.
2018-12-15T12:29:17.517864 [TRACE] Received a data message from 192.168.0.5:12021 but its sequence number is outdated (received: 540960, expecting: 540961). Ignoring.
2018-12-15T12:29:20.472033 [DEBUG] Resolving 192.168.0.5:12021 for potential contact...
2018-12-15T12:29:20.472033 [DEBUG] A session already exists with 192.168.0.5:12021 (at 192.168.0.5:12021). Not contacting again.
2018-12-15T12:29:37.425003 [TRACE] Received a data message from 192.168.0.5:12021 but its sequence number is outdated (received: 746037, expecting: 746038). Ignoring.
2018-12-15T12:29:50.472749 [DEBUG] Resolving 192.168.0.5:12021 for potential contact...
2018-12-15T12:29:50.472749 [DEBUG] A session already exists with 192.168.0.5:12021 (at 192.168.0.5:12021). Not contacting again.
2018-12-15T12:30:02.699449 [TRACE] Received a data message from 192.168.0.5:12021 but its sequence number is outdated (received: 1009699, expecting: 1009700). Ignoring.
2018-12-15T12:30:20.473465 [DEBUG] Resolving 192.168.0.5:12021 for potential contact...
2018-12-15T12:30:23.629646 [DEBUG] A session already exists with 192.168.0.5:12021 (at 192.168.0.5:12021). Not contacting again.
Not much more to discern in this test topology
Case 2) is harder to test because high memory usage usually slows down Linux A terminal beyond usability. Also, I have some fear of losing control over Linux A host.
I also have some interesting results in the following 4 node test topology. In the following test scenario, I inject several flood events on Node A (Windows) which starves the other nodes (Node 1, 2, 3) from their network packets from being processed on Node A (Windows).
After several flood events (and when flooding has stopped), I immediately restart the freelan.exe
process on Node A (Windows) in the image above. Occasionally, I'm able to reproduce high-CPU usage on Node A (Windows). Below is a CPU process profile during persistent 40-50% CPU usage.
To clarify:
freelan.exe
process on Node A (Windows)./usr/bin/freelan
process is NOT restarted on Nodes 1, 2, 3 (Ubuntu). They are left as is during this testing process.The results are reminiscent of the real-world 13-node test, case 1) when Linux A undergoes high-CPU usage only.
(CPU-Profiling, click image below for larger screen and zoom)
Occasionally, in this 4-node topology setup, I can reproduce high-CPU usage & memory leak. Again, eerily similar to the real-world 13-node case 2 that Linux A undergoes when the terminal becomes unresponsive.
In the following image, almost 2GB of memory is consumed. This occurred with NO FLOODING whatsoever. Only a simple process restart on Node A (Windows). This memory ramp up happened on its own. The only thing I did was restart the freelan.exe
process on Node A (Windows) with the debugger.
I paused the debugger at the very peak of the last ramp to see what was going on under the hood. My debugger stopped in this section of code deep inside boost::bind()
. The call stack was several layers deep in boost:bind() -> boost:list() -> boost:storage()
. The callstack had to be unwound before I finally ended up here:
Notice, the this->m_write_queue { size = 20556 }
. Is this of any particular interest? Perhaps, is it possible for this write_queue
to get backed up and grow unbounded in the event that some IO operation that never completes?
I'm going to try to limit the write_queue
size and see if this helps control the unbounded memory growth.
-Brian
I found a strange case when we try to relay packet from one host to itself. At that time CPU usage was a bit high.
I fix that in 57423405f69da850d10a6fbd4b4cade9b422e886.
Hey @s-vincent , 5742340 has a build failure?
https://travis-ci.org/freelan-developers/freelan/jobs/468685610#L1232
I also have the same build failure on Ubuntu 14.04 build box.
root@test:~/freelan22# g++ --version
g++ (Ubuntu 4.8.4-2ubuntu1~14.04.4) 4.8.4
Copyright (C) 2013 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
build/release/libs/freelan/src/switch.cpp: In member function ‘void freelan::switch_::async_write(freelan::port_index_type, boost::asio::const_buffer, freelan::switch_::multi_write_handler_type)’:
build/release/libs/freelan/src/switch.cpp:142:13: error: ambiguous overload for ‘operator!=’ (operand types are ‘freelan::port_index_type {aka boost::variant<freelan::null_port_index_type, freelan::tap_adapter_port_index_type, freelan::endpoint_port_index_type>}’ and ‘const boost::variant<freelan::null_port_index_type, freelan::tap_adapter_port_index_type, freelan::endpoint_port_index_type>’)
if(index != target)
^
Compiles okay without problems on Ubuntu 18.04.
g++ (Ubuntu 7.3.0-27ubuntu1~18.04) 7.3.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
It should be good now with latest revision. Build 445 passed on TravisCI.
Hi @s-vincent,
We have VERY GOOD :tada: test results with the latest changes at /master
780d8ef49c1c8c9c3d8ad286b1cf6cc245c2601e. I think 5742340 fixed many issues, including the high-CPU usage and ambient memory leak after process reboots.
I tested both 13-node real-world and 4-node topology setups.
:heavy_check_mark: The "echo"/chatter traffic on VPN after process reboots is gone. :heavy_check_mark: The high-CPU usage after process reboots appears to be gone. :heavy_check_mark: The high-CPU usage & memory leak after process reboots appear to be gone. :heavy_check_mark: The VPN network seems to fully recover after a flood attack. :heavy_check_mark: The VPN network seems to fully recover after process reboots. :heavy_check_mark: Monotonically increasing memory usage is no longer observed. (We have sawtooth pattern now, see image below). :warning: During a flood attack, VPN network traffic between all nodes still stops. :warning: During a flood attack, memory usage continues to increase without limit or bound. If the flood can be sustained against a target, available memory can be depleted. (see image below)
I think the last two warnings can be resolved by some extra threading changes or buffer architecture changes, but overall, we're in a very good spot. This is possibly the most stable source in FreeLAN history.
Great work Sebastien! I will continue to monitor and test the stability of the latest changes in the coming weeks.
For others who wish to test: Windows Binaries: https://ci.appveyor.com/project/bchavez/freelan/build/job/qa7belfqbo90hfd8/artifacts
Ubuntu (build from source)
https://github.com/bchavez/freelan/tree/debian/unstable
Brian
:walking: :sunglasses: "So don't delay... act now supplies are running out..."
Hi,
I downloaded Freelan 2.0 from the releases tab here on Github.
I've installed Freelan 2.0 on Windows (via x64 installer on release tab) and memory seems stable at about 95MB.
I've compiled, built, and installed Freelan 2.0 (via source on relase tab) on
Ubuntu 14.04.2 LTS
, all compiles fine and runs OK.I have a 5 windows nodes and 1 Linux node setup in a star topology. The Linux node is acting as a switch/relay for the 5 windows nodes.
After about 5 minutes, the VIRT memory on Linux seems to explode into several GB. Here's an
htop
screen shot of the memory usage:After about 10 minutes the system begins to become non-responsive. :(
Any idea on how to begin debugging this kind of problem?
The network is using certificate based authentication on all nodes.
Thanks, Brian
I guess I will try using passphrase auth and see if that helps .... hmm.