freelan-developers / freelan

The main freelan repository.
http://www.freelan.org
Other
1.35k stars 200 forks source link

Possible Memory Leak in Freelan 2.0 #74

Closed bchavez closed 5 years ago

bchavez commented 9 years ago

Hi,

I downloaded Freelan 2.0 from the releases tab here on Github.

I've installed Freelan 2.0 on Windows (via x64 installer on release tab) and memory seems stable at about 95MB.

I've compiled, built, and installed Freelan 2.0 (via source on relase tab) on Ubuntu 14.04.2 LTS, all compiles fine and runs OK.

I have a 5 windows nodes and 1 Linux node setup in a star topology. The Linux node is acting as a switch/relay for the 5 windows nodes.

After about 5 minutes, the VIRT memory on Linux seems to explode into several GB. Here's an htop screen shot of the memory usage:

screen400

After about 10 minutes the system begins to become non-responsive. :(

Any idea on how to begin debugging this kind of problem?

The network is using certificate based authentication on all nodes.

Thanks, Brian

I guess I will try using passphrase auth and see if that helps .... hmm.

bchavez commented 9 years ago

I tried only using passphrases and the problem persists on Linux.

I added a few more nodes to the network, now 7 windows PC nodes connected 1 Linux server.

If you bring up each node one by one. The problem seems to begin around adding the 5th or 6th node to the Freelan network in a star topology...

One way I've found to reliably reproduce the memory leak is to:

  1. Turn on all nodes in the network except the Linux hub/node.
  2. Turn on the Linux hub/node and all other nodes will hammer the Linux node.

Also the CPU tends to jam at 100% sometimes. All very strange. Let me know if there's anything else I can do to help.

bchavez commented 9 years ago

Seems like windows is not immune.

I swapped the center linux hub node with a windows node, and problem persists with a windows node/hub:

screen402

830M+ memory usage and continues to climb.

Below is a dump of the center hub (windows) when memory explodes:

C:\Program Files\FreeLAN>bin\freelan -t 1 -d -c config\freelan.cfg
2015-05-10T03:48:13.104035 [INFORMATION] Reading configuration file at: "C:\Program Files\FreeLAN\config\freelan.cfg"
2015-05-10T03:48:13.140037 [INFORMATION] Deriving pre-shared key from passphrase...
2015-05-10T03:48:13.158038 [DEBUG] Opening core...
2015-05-10T03:48:13.169038 [INFORMATION] Enabling pre-shared key authentication.
2015-05-10T03:48:13.186039 [INFORMATION] Starting FSCP server...
2015-05-10T03:48:13.203040 [IMPORTANT] Core set to listen on: 0.0.0.0:12000
2015-05-10T03:48:13.220041 [INFORMATION] Building CA store...
2015-05-10T03:48:13.234042 [INFORMATION] FSCP server started.
2015-05-10T03:48:13.251043 [IMPORTANT] Tap adapter "FREELAN" opened in mode tap with a MTU set to: 1446
2015-05-10T03:48:13.269044 [IMPORTANT] MSS override enabled with a value of: 1392
2015-05-10T03:48:13.286045 [INFORMATION] IPv4 address: 25.0.0.3/8
2015-05-10T03:48:13.303046 [INFORMATION] No IPv6 address configured.
2015-05-10T03:48:14.623122 [INFORMATION] Setting interface metric to: 3
2015-05-10T03:48:14.633122 [INFORMATION] Putting interface into the connected state.
2015-05-10T03:48:14.651123 [INFORMATION] The DHCP proxy is enabled.
2015-05-10T03:48:14.667124 [INFORMATION] Not advertising any routes.
2015-05-10T03:48:14.682125 [INFORMATION] Advertising the following DNS servers: 25.0.0.5
2015-05-10T03:48:14.701126 [DEBUG] Core opened.
2015-05-10T03:48:14.710127 [INFORMATION] Using 1 thread(s).
2015-05-10T03:48:14.723127 [IMPORTANT] Execution started.
2015-05-10T03:48:14.701126 [INFORMATION] Starting tap adapter's thread...
2015-05-10T03:48:14.735128 [DEBUG] Thread #0 started.
2015-05-10T03:48:15.028145 [TRACE] Received a data message from 192.168.0.100:12000 but no session exists. Ignoring.
2015-05-10T03:48:18.063318 [TRACE] Received a data message from AAA.BB.CC.54:12000 but no session exists. Ignoring.
2015-05-10T03:48:18.165324 [TRACE] Received a data message from AAA.BB.CC.54:12000 but no session exists. Ignoring.
2015-05-10T03:48:18.363336 [TRACE] Received a data message from AAA.BB.CC.54:12000 but no session exists. Ignoring.
2015-05-10T03:48:19.112378 [TRACE] Received a data message from AAA.BB.CC.54:12000 but no session exists. Ignoring.
2015-05-10T03:48:19.464399 [TRACE] Received a data message from AAA.BB.CC.53:12000 but no session exists. Ignoring.
2015-05-10T03:48:19.864421 [TRACE] Received a data message from AAA.BB.CC.54:12000 but no session exists. Ignoring.
2015-05-10T03:48:20.633465 [TRACE] Received a data message from AAA.BB.CC.54:12000 but no session exists. Ignoring.
2015-05-10T03:48:20.735471 [TRACE] Received a data message from AAA.BB.CC.54:12000 but no session exists. Ignoring.
2015-05-10T03:48:20.934483 [TRACE] Received a data message from AAA.BB.CC.54:12000 but no session exists. Ignoring.
2015-05-10T03:48:21.684526 [TRACE] Received a data message from AAA.BB.CC.54:12000 but no session exists. Ignoring.
2015-05-10T03:48:29.320962 [DEBUG] Received HELLO_REQUEST from AAA.BB.CC.54:12000.
2015-05-10T03:48:29.354964 [TRACE] Received a SESSION_REQUEST from AAA.BB.CC.54:12000 but no presentation is available. Ignoring.
2015-05-10T03:48:29.373965 [DEBUG] Received PRESENTATION from AAA.BB.CC.54:12000 using pre-shared key authentication.
2015-05-10T03:48:29.396967 [INFORMATION] Accepting PRESENTATION from AAA.BB.CC.54:12000 for pre-shared key authentication: first presentation.
2015-05-10T03:48:29.420968 [DEBUG] Sending SESSION_REQUEST to AAA.BB.CC.54:12000.
2015-05-10T03:48:29.436969 [TRACE] Sending session request message to AAA.BB.CC.54:12000 (next_session_number: 0, local_host_identifier: f870c8a2f771e1782f794434afd5ce31340a52e7007554eecf340365435d7848)
2015-05-10T03:48:29.505973 [DEBUG] Received SESSION from AAA.BB.CC.54:12000 (default: accept).
2015-05-10T03:48:29.521974 [DEBUG] Cipher suite: ecdhe_rsa_aes256_gcm_sha384
2015-05-10T03:48:29.539975 [DEBUG] Elliptic curve: sect571k1
2015-05-10T03:48:29.558976 [TRACE] Received a SESSION from AAA.BB.CC.54:12000 with session number 0 but no session was prepared yet. Preparing a new one.
2015-05-10T03:48:29.612979 [TRACE] Session established with AAA.BB.CC.54:12000. Sending acknowledgement session message back.
2015-05-10T03:48:29.633980 [TRACE] Sending session message to AAA.BB.CC.54:12000 (session number: 0, cipher suite: ecdhe_rsa_aes256_gcm_sha384, elliptic curve: sect571k1).
2015-05-10T03:48:29.665982 [IMPORTANT] Session established with AAA.BB.CC.54:12000.
2015-05-10T03:48:29.682983 [INFORMATION] Cipher suite: ecdhe_rsa_aes256_gcm_sha384
2015-05-10T03:48:29.698984 [INFORMATION] Elliptic curve: sect571k1
2015-05-10T03:48:29.711985 [DEBUG] Sending routes request to AAA.BB.CC.54:12000.
2015-05-10T03:48:29.727986 [INFORMATION] Added system route: Network interface #14 - AAA.BB.CC.54/32 => 192.168.0.1 - metric 0
2015-05-10T03:48:29.786989 [TRACE] Received a SESSION from AAA.BB.CC.54:12000 with session number 0 and cipher suite ecdhe_rsa_aes256_gcm_sha384. A session currently exists and has the same number and cipher suite. Ignoring.
2015-05-10T03:48:29.818991 [DEBUG] Received routes request from AAA.BB.CC.54:12000. Replying with version 0:
2015-05-10T03:48:29.841992 [DEBUG] Sending routes and DNS servers to AAA.BB.CC.54:12000: version 0 (), (25.0.0.5).
2015-05-10T03:48:37.350422 [DEBUG] Received HELLO_REQUEST from AAA.BB.CC.55:12000.
2015-05-10T03:48:37.378423 [DEBUG] Received PRESENTATION from AAA.BB.CC.55:12000 using pre-shared key authentication.
2015-05-10T03:48:37.398424 [INFORMATION] Accepting PRESENTATION from AAA.BB.CC.55:12000 for pre-shared key authentication: first presentation.
2015-05-10T03:48:37.421426 [DEBUG] Sending SESSION_REQUEST to AAA.BB.CC.55:12000.
2015-05-10T03:48:37.437427 [TRACE] Sending session request message to AAA.BB.CC.55:12000 (next_session_number: 0, local_host_identifier: 76f611ccf832b4d8cb7c94a9a45289a9b55b383601d60a2a8d397d765cabd696)
2015-05-10T03:48:37.466428 [DEBUG] Received SESSION_REQUEST from AAA.BB.CC.55:12000 (default: accept).
2015-05-10T03:48:37.484429 [DEBUG] Cipher suites capabilities: ecdhe_rsa_aes256_gcm_sha384 ecdhe_rsa_aes128_gcm_sha256
2015-05-10T03:48:37.508431 [DEBUG] Elliptic curve capabilities: sect571k1 secp384r1
2015-05-10T03:48:37.522431 [TRACE] Received a SESSION_REQUEST from AAA.BB.CC.55:12000 with session number 0 and cipher suite ecdhe_rsa_aes256_gcm_sha384_sect571k1. No current session exist: preparing one and sending it.
2015-05-10T03:48:37.569434 [TRACE] Sending session message to AAA.BB.CC.55:12000 (session number: 0, cipher suite: ecdhe_rsa_aes256_gcm_sha384, elliptic curve: sect571k1).
2015-05-10T03:48:37.646438 [DEBUG] Received SESSION from AAA.BB.CC.55:12000 (default: accept).
2015-05-10T03:48:37.661439 [DEBUG] Cipher suite: ecdhe_rsa_aes256_gcm_sha384
2015-05-10T03:48:37.677440 [DEBUG] Elliptic curve: sect571k1
2015-05-10T03:48:37.699442 [TRACE] Session established with AAA.BB.CC.55:12000. Sending acknowledgement session message back.
2015-05-10T03:48:37.718443 [TRACE] Sending session message to AAA.BB.CC.55:12000 (session number: 0, cipher suite: ecdhe_rsa_aes256_gcm_sha384, elliptic curve: sect571k1).
2015-05-10T03:48:37.751445 [IMPORTANT] Session established with AAA.BB.CC.55:12000.
2015-05-10T03:48:37.767445 [INFORMATION] Cipher suite: ecdhe_rsa_aes256_gcm_sha384
2015-05-10T03:48:37.785446 [INFORMATION] Elliptic curve: sect571k1
2015-05-10T03:48:37.799447 [DEBUG] Sending routes request to AAA.BB.CC.55:12000.
2015-05-10T03:48:37.814448 [INFORMATION] Added system route: Network interface #14 - AAA.BB.CC.55/32 => 192.168.0.1 - metric 0
2015-05-10T03:48:37.835449 [TRACE] Received a SESSION from AAA.BB.CC.55:12000 with session number 0 and cipher suite ecdhe_rsa_aes256_gcm_sha384. A session currently exists and has the same number and cipher suite. Ignoring.
2015-05-10T03:48:37.872451 [DEBUG] Received routes request from AAA.BB.CC.55:12000. Replying with version 0:
2015-05-10T03:48:37.892453 [DEBUG] Sending routes and DNS servers to AAA.BB.CC.55:12000: version 0 (), (25.0.0.5).
2015-05-10T03:48:39.508545 [DEBUG] Received HELLO_REQUEST from AAA.BB.CC.53:12000.
2015-05-10T03:48:39.538547 [TRACE] Received a SESSION_REQUEST from AAA.BB.CC.53:12000 but no presentation is available. Ignoring.
2015-05-10T03:48:39.560548 [DEBUG] Received PRESENTATION from AAA.BB.CC.53:12000 using pre-shared key authentication.
2015-05-10T03:48:39.580549 [INFORMATION] Accepting PRESENTATION from AAA.BB.CC.53:12000 for pre-shared key authentication: first presentation.
2015-05-10T03:48:39.604550 [DEBUG] Sending SESSION_REQUEST to AAA.BB.CC.53:12000.
2015-05-10T03:48:39.620551 [TRACE] Sending session request message to AAA.BB.CC.53:12000 (next_session_number: 0, local_host_identifier: 232637cbc0fa3e41d911adbb86e82454bc4f873d8b1910b8261f9c58425038c2)
2015-05-10T03:48:39.692556 [DEBUG] Received SESSION from AAA.BB.CC.53:12000 (default: accept).
2015-05-10T03:48:39.706556 [DEBUG] Cipher suite: ecdhe_rsa_aes256_gcm_sha384
2015-05-10T03:48:39.720557 [DEBUG] Elliptic curve: sect571k1
2015-05-10T03:48:39.731558 [TRACE] Received a SESSION from AAA.BB.CC.53:12000 with session number 0 but no session was prepared yet. Preparing a new one.
2015-05-10T03:48:39.774560 [TRACE] Session established with AAA.BB.CC.53:12000. Sending acknowledgement session message back.
2015-05-10T03:48:39.795561 [TRACE] Sending session message to AAA.BB.CC.53:12000 (session number: 0, cipher suite: ecdhe_rsa_aes256_gcm_sha384, elliptic curve: sect571k1).
2015-05-10T03:48:39.827563 [IMPORTANT] Session established with AAA.BB.CC.53:12000.
2015-05-10T03:48:39.855565 [INFORMATION] Cipher suite: ecdhe_rsa_aes256_gcm_sha384
2015-05-10T03:48:39.875566 [INFORMATION] Elliptic curve: sect571k1
2015-05-10T03:48:39.890567 [DEBUG] Sending routes request to AAA.BB.CC.53:12000.
2015-05-10T03:48:39.907568 [INFORMATION] Added system route: Network interface #14 - AAA.BB.CC.53/32 => 192.168.0.1 - metric 0
2015-05-10T03:48:39.975572 [TRACE] Received a SESSION from AAA.BB.CC.53:12000 with session number 0 and cipher suite ecdhe_rsa_aes256_gcm_sha384. A session currently exists and has the same number and cipher suite. Ignoring.
2015-05-10T03:48:40.007574 [DEBUG] Received routes request from AAA.BB.CC.53:12000. Replying with version 0:
2015-05-10T03:48:40.026575 [DEBUG] Sending routes and DNS servers to AAA.BB.CC.53:12000: version 0 (), (25.0.0.5).
2015-05-10T03:48:41.442656 [DEBUG] Received HELLO_REQUEST from AAA.BB.CC.35:12000.
2015-05-10T03:48:41.485658 [TRACE] Received a SESSION_REQUEST from AAA.BB.CC.35:12000 but no presentation is available. Ignoring.
2015-05-10T03:48:41.504659 [DEBUG] Received PRESENTATION from AAA.BB.CC.35:12000 using pre-shared key authentication.
2015-05-10T03:48:41.522660 [INFORMATION] Accepting PRESENTATION from AAA.BB.CC.35:12000 for pre-shared key authentication: first presentation.
2015-05-10T03:48:41.544661 [DEBUG] Sending SESSION_REQUEST to AAA.BB.CC.35:12000.
2015-05-10T03:48:41.558662 [TRACE] Sending session request message to AAA.BB.CC.35:12000 (next_session_number: 0, local_host_identifier: 0182b8d34a6a2775708a3c0876f854e9e3d45a8dab8bf37242417d367376f937)
2015-05-10T03:48:41.627666 [DEBUG] Received SESSION from AAA.BB.CC.35:12000 (default: accept).
2015-05-10T03:48:41.641667 [DEBUG] Cipher suite: ecdhe_rsa_aes256_gcm_sha384
2015-05-10T03:48:41.656668 [DEBUG] Elliptic curve: sect571k1
2015-05-10T03:48:41.668669 [TRACE] Received a SESSION from AAA.BB.CC.35:12000 with session number 0 but no session was prepared yet. Preparing a new one.
2015-05-10T03:48:41.714671 [TRACE] Session established with AAA.BB.CC.35:12000. Sending acknowledgement session message back.
2015-05-10T03:48:41.733672 [TRACE] Sending session message to AAA.BB.CC.35:12000 (session number: 0, cipher suite: ecdhe_rsa_aes256_gcm_sha384, elliptic curve: sect571k1).
2015-05-10T03:48:41.760674 [IMPORTANT] Session established with AAA.BB.CC.35:12000.
2015-05-10T03:48:41.777675 [INFORMATION] Cipher suite: ecdhe_rsa_aes256_gcm_sha384
2015-05-10T03:48:41.793676 [INFORMATION] Elliptic curve: sect571k1
2015-05-10T03:48:41.808677 [DEBUG] Sending routes request to AAA.BB.CC.35:12000.
2015-05-10T03:48:41.825678 [INFORMATION] Added system route: Network interface #14 - AAA.BB.CC.35/32 => 192.168.0.1 - metric 0
2015-05-10T03:48:41.890681 [TRACE] Received a SESSION from AAA.BB.CC.35:12000 with session number 0 and cipher suite ecdhe_rsa_aes256_gcm_sha384. A session currently exists and has the same number and cipher suite. Ignoring.
2015-05-10T03:48:41.924683 [DEBUG] Received routes request from AAA.BB.CC.35:12000. Replying with version 0:
2015-05-10T03:48:41.947685 [DEBUG] Sending routes and DNS servers to AAA.BB.CC.35:12000: version 0 (), (25.0.0.5).
2015-05-10T03:48:45.352879 [DEBUG] Received HELLO_REQUEST from 192.168.0.100:12000.
2015-05-10T03:48:45.366880 [DEBUG] Received PRESENTATION from 192.168.0.100:12000 using pre-shared key authentication.
2015-05-10T03:48:45.378881 [INFORMATION] Accepting PRESENTATION from 192.168.0.100:12000 for pre-shared key authentication: first presentation.
2015-05-10T03:48:45.396882 [DEBUG] Sending SESSION_REQUEST to 192.168.0.100:12000.
2015-05-10T03:48:45.413883 [TRACE] Sending session request message to 192.168.0.100:12000 (next_session_number: 0, local_host_identifier: 60463ef586abf360f9b8294fb7b6a5991cbcf5d2875a78c54bad6570549706d7)
2015-05-10T03:48:45.446885 [DEBUG] Received SESSION_REQUEST from 192.168.0.100:12000 (default: accept).
2015-05-10T03:48:45.464886 [DEBUG] Cipher suites capabilities: ecdhe_rsa_aes256_gcm_sha384 ecdhe_rsa_aes128_gcm_sha256
2015-05-10T03:48:45.485887 [DEBUG] Elliptic curve capabilities: sect571k1 secp384r1
2015-05-10T03:48:45.503888 [TRACE] Received a SESSION_REQUEST from 192.168.0.100:12000 with session number 0 and cipher suite ecdhe_rsa_aes256_gcm_sha384_sect571k1. No current session exist: preparing one and sending it.
2015-05-10T03:48:45.542890 [TRACE] Sending session message to 192.168.0.100:12000 (session number: 0, cipher suite: ecdhe_rsa_aes256_gcm_sha384, elliptic curve: sect571k1).
2015-05-10T03:48:45.577892 [DEBUG] Received SESSION from 192.168.0.100:12000 (default: accept).
2015-05-10T03:48:45.593893 [DEBUG] Cipher suite: ecdhe_rsa_aes256_gcm_sha384
2015-05-10T03:48:45.609894 [DEBUG] Elliptic curve: sect571k1
2015-05-10T03:48:45.630895 [TRACE] Session established with 192.168.0.100:12000. Sending acknowledgement session message back.
2015-05-10T03:48:45.691899 [TRACE] Sending session message to 192.168.0.100:12000 (session number: 0, cipher suite: ecdhe_rsa_aes256_gcm_sha384, elliptic curve: sect571k1).
2015-05-10T03:48:45.722900 [IMPORTANT] Session established with 192.168.0.100:12000.
2015-05-10T03:48:45.739901 [INFORMATION] Cipher suite: ecdhe_rsa_aes256_gcm_sha384
2015-05-10T03:48:45.755902 [INFORMATION] Elliptic curve: sect571k1
2015-05-10T03:48:45.769903 [DEBUG] Sending routes request to 192.168.0.100:12000.
2015-05-10T03:48:45.785904 [INFORMATION] Added system route: Network interface #14 - 192.168.0.100/32 - metric 0
2015-05-10T03:48:45.803905 [TRACE] Received a SESSION from 192.168.0.100:12000 with session number 0 and cipher suite ecdhe_rsa_aes256_gcm_sha384. A session currently exists and has the same number and cipher suite. Ignoring.
2015-05-10T03:48:45.836907 [DEBUG] Received routes request from 192.168.0.100:12000. Replying with version 0:
2015-05-10T03:48:45.856908 [DEBUG] Sending routes and DNS servers to 192.168.0.100:12000: version 0 (), (25.0.0.5).
2015-05-10T03:49:04.794991 [DEBUG] Received routes request from AAA.BB.CC.53:12000. Replying with version 0:
2015-05-10T03:49:04.858995 [DEBUG] Sending routes and DNS servers to AAA.BB.CC.53:12000: version 0 (), (25.0.0.5).
2015-05-10T03:49:29.098381 [DEBUG] Received routes request from AAA.BB.CC.35:12000. Replying with version 0:
2015-05-10T03:49:29.118383 [DEBUG] Sending routes and DNS servers to AAA.BB.CC.35:12000: version 0 (), (25.0.0.5).
2015-05-10T03:49:29.372397 [DEBUG] Received routes request from AAA.BB.CC.54:12000. Replying with version 0:
2015-05-10T03:49:29.382398 [DEBUG] Sending routes and DNS servers to AAA.BB.CC.54:12000: version 0 (), (25.0.0.5).
2015-05-10T03:49:37.265849 [DEBUG] Received routes request from AAA.BB.CC.55:12000. Replying with version 0:
2015-05-10T03:49:37.281849 [DEBUG] Sending routes and DNS servers to AAA.BB.CC.55:12000: version 0 (), (25.0.0.5).
2015-05-10T03:50:45.347743 [DEBUG] Received routes request from 192.168.0.100:12000. Replying with version 0:
2015-05-10T03:50:45.357743 [DEBUG] Sending routes and DNS servers to 192.168.0.100:12000: version 0 (), (25.0.0.5).
2015-05-10T03:51:13.178334 [DEBUG] Sending routes request to all hosts.
2015-05-10T03:52:04.652279 [WARNING] Signal caught (2): exiting...
2015-05-10T03:52:04.798287 [DEBUG] Closing core...
2015-05-10T03:52:05.099304 [INFORMATION] Tap adapter's thread is now stopped.
2015-05-10T03:52:06.191367 [INFORMATION] Closing FSCP server...
2015-05-10T03:52:06.632392 [INFORMATION] FSCP server closed.
2015-05-10T03:52:06.640392 [DEBUG] Core closed.
2015-05-10T03:52:08.147478 [INFORMATION] Removed system route: Network interface #14 - AAA.BB.CC.55/32 => 192.168.0.1 - metric 0
2015-05-10T03:52:08.172480 [INFORMATION] Removed system route: Network interface #14 - AAA.BB.CC.54/32 => 192.168.0.1 - metric 0
2015-05-10T03:52:08.194481 [INFORMATION] Removed system route: Network interface #14 - AAA.BB.CC.53/32 => 192.168.0.1 - metric 0
2015-05-10T03:52:08.221483 [INFORMATION] Removed system route: Network interface #14 - AAA.BB.CC.35/32 => 192.168.0.1 - metric 0
2015-05-10T03:52:08.488498 [INFORMATION] Removed system route: Network interface #14 - 192.168.0.100/32 - metric 0
ereOn commented 9 years ago

@bchavez Thanks for the thorough bug report.

I'll give a serious look at this today and keep you posted.

My hunch is that this has something to do with the new memory allocation scheme. Memory chunks are pre-allocated in a pool but if the pool is exhausted, then FreeLAN falls back to dynamic allocation. Dynamic blocks are added to the pool but never removed, resulting in a high usage of memory that never goes down.

ereOn commented 9 years ago

@bchavez Could please detail the setup you used to reproduce this ?

I started 6 instances on the same Linux box (one central - 5 clients) and ping flooded the whole mesh but couldn't notice a memory increase.

I also tried with an OS X instance and Windows instances both connected to the same mesh I described, but couldn't notice a memory increase either.

I'm convinced there is something to improve there, but without any way of reproducing, it's going to be hard to do.

ereOn commented 9 years ago

@bchavez I still couldn't reproduce the issue, but made an attempt to fix it anyway.

Could you please try the memory_usage branch ?

bchavez commented 9 years ago

Hi @ereOn,

Thanks so much for your help. Indeed, very strange. I'll try everything to help.

The setup topology is a star topology, specifically no nodes are allowed to talk to each other, only through a centralized hub node.

The overall topology is similar to:

         [5 win nodes]
               | (internet)
(firewall with port forwarding: 12000)
           [Linux Hub]
            /       \ (local LAN)
   [win node]        [win node]

(please excuse my weak attempt at ASCII art)

My configuration files:

Linux Hub:

[client]
public_endpoint=0.0.0.0
[fscp]
accept_contacts=no
[tap_adapter]
ipv4_address_prefix_length=25.0.0.5/8
[switch]
relay_mode_enabled=yes
[router]
local_dns_server=25.0.0.5
[security]
passphrase=hello
passphrase_salt=world
#Notes: Specifically removing ipv6, all freelan nodes have IPs that begin with 25.*.*.*
#The rest of the settings are default values left commented and not overridden

Windows node:

[client]
public_endpoint=0.0.0.0
[fscp]
contact=linux.hub.com:12000
accept_contact_requests=no
accept_contacts=no
[tap_adapter]
ipv4_address_prefix_length=25.0.0.201/8
[security]
passphrase=hello
passphrase_salt=world
# Below was used when testing with certificates:
#signature_certificate_file="C:\Program Files\FreeLAN\config\node01.crt"
#signature_private_key_file="C:\Program Files\FreeLAN\config\node01.key"
#authority_certificate_file="C:\Program Files\FreeLAN\config\ca.crt"

Other windows nodes are various increments to the IP and certificate path.

Also, I do have a few other windows nodes that I didn't update yet. So, maybe a possibility that the Freelan 1.1 nodes are interfering (DoS-ing) the Linux Hob through incompatible protocol verbiage? I'm not really sure, just trying to get you every detail I have.

Sure thing, I'll try recompiling the memory_branch and let you know how it goes.

Edit: The Linux Hub is Ubuntu 14.04.2 The Windows nodes are a various mix of Windows Server and Windows 7.

ereOn commented 9 years ago

@bchavez Don't bother right now: this commit introduces a bug that I'm trying to fix.

bchavez commented 9 years ago

@ereOn Sure thing, whenever you're ready I'll be happy to help. Totally understandable, it's quite frustrating not being able to reproduce the issue. :(

ereOn commented 9 years ago

@bchavez Okay, I hardcoded some limits in the memory allocation.

Could you test it ?

bchavez commented 9 years ago

Yes sir, will test right now. BTW, I'm on IRC channel if needed.

ghost commented 8 years ago

Was this issue resolved?

ereOn commented 8 years ago

@sbeckwithiii It might have been. There is actually no confirmation that the issue was real nor that the "fix" changed anything.

richman1000000 commented 7 years ago

Hi! I haven't seen memory blows of freelan processes since i've discovered problem of selft-contact. Can you try to block local ip from contact list? "ipv4_address_prefix_length=10.18.0.10/23" - here you assign local ip "never_contact=10.18.0.10" - here you deny contacting local ip. also can you check #83

richman1000000 commented 7 years ago

hi! I got this memory leak! I had freelan crashing, so reading this topic I've increased my swapfile by 50 GBs, but I do not remember what I've changed prior to problem appearing. I've got this leak several times before deciding to report it. I'm not changing my configurations for easier analysis.

How can I collect and send you debug data? I'm not a developer, but admin. If you will provide me step by step instruction, I will send it to you. image

s-vincent commented 7 years ago

Does it happens with latest code from master repository or with some official version / Debian package ?

bchavez commented 7 years ago

@s-vincent , for me the last time I checked, this happened with the Freelan 2.0 official release. The problem existed in both Windows and Linux Debian release apt package. IIRC I also tested some versions built by @ereOn that had memory modifications presumably built from /master at the time and the problem still existed.

Since the time of these tests, I've downgraded to Freelan 1.1 for now and everything has been somewhat stable. IIRC, the network throughput in Freelan 1.0 about an order of magnitude faster than 1.1; however, there was no way for me to get back to Freelan 1.0 Linux wise because the binaries in the apt-get release 1.0 site were wiped and there's no GitHub source tag for 1.0.

s-vincent commented 7 years ago

@bchavez Any chance for you to test the latest git revision (master branch) and see if memory leak still happens ?

leggewie commented 7 years ago

clearing 2.1 milestone until this can be successfully reproduced

bchavez commented 7 years ago

Hi @s-vincent , sorry for the late reply. I'll be happy to test this if there are some windows binaries available. IIRC, it was painful last time I built freelan from source to produce its associated binaries. It's been a while and I don't remember all the intricacies.

s-vincent commented 7 years ago

I have not binaries right now. Please see one of my comment of issue #93, I added step-by-step instructions for Windows some days ago. Only addition is to install cmake package.

richman1000000 commented 7 years ago

as I mentioned in comment https://github.com/freelan-developers/freelan/issues/93#issuecomment-321739161 I think memory leak always caused by same windows machine.

To configure freelan I use sometimes WinSCP and was surprised by some troubles I've encountered long ago. There is an invisible symbol at the end of the line or something like that. Presence of this symbol at the end of the line ipv4_address_prefix_length= causes error but I do not see any instant effects of this symbol on lines signature_certificate_file= or signature_private_key_file= Recently I've used command "unix2dos freelan.cfg" on the windows machine causing problem and do not have leakage since.

Can you check effect of that symbol in ssl library?

leggewie commented 7 years ago

What @richman1000000 is describing sounds like an incompatibility in CRLF to denote end-of-line for the config file in Windows. @ereOn and @s-vincent Can freelan or one of the libraries be thrown off the rails with an improper line-ending? @bchavez Do you write your config files in Windows or Ubuntu? Can you check the format of the line-ending (file will do that on Unix).

https://stackoverflow.com/questions/1552749/difference-between-cr-lf-lf-and-cr-line-break-types

s-vincent commented 7 years ago

@leggewie, for Windows the ipv4/6_address_prefix_length parameter goes to a netsh.exe command. On Linux it goes to netlink socket.

richman1000000 commented 7 years ago

hello! Do you think freelan is affected by this parameter? /sys/kernel/mm/transparent_hugepage/enabled

s-vincent commented 6 years ago

@bchavez @richman1000000 Do you still experiment memory leak in latest 2.1 / master branch ?

richman1000000 commented 6 years ago

current version I use is freelan 2.1-1 I did not had memory leak for half of year, but I always specify on all node like this

ipv4_address_prefix_length=VPNself_ip/prefix never_contact=VPNself_ip

bchavez commented 6 years ago

TBH, I haven't had time to compile and test 2.1. :( I might revisit this next time I plan an upgrade. For now, I haven't moved off 2.0 stable (albeit with degraded performance compared to 1.0). I'm currently in a situation where I "can't/don't touch it" right now.

leggewie commented 6 years ago

I will close this issue as fixed, then. Please ping us to reopen if you experience this again. Thank you for bringing this to our attention.

bchavez commented 6 years ago

Unfortunately, I tested the latest branch. The memory leak still exists.

I'd appreciate if this issue was reopened.

Here is a video screen capture of the issue as it happened: screen_2268

A little more about the topology where this happens. I have a setup similar to this:

powerpnt_2269

The memory leak is happening on Linux A. Sometimes high CPU loads on Windows F.

Basically,

One way I've been able to reproduce the issue is:

Kinda of feel there's a threading/concurrency/network state issue going on someplace. Soon as Linux A comes online; it's hammered with connection requests from other nodes, but the other nodes on the network still "remember" previous network state/sessions before Linux A went down and are trying to push data down a non-existent session on Linux A. When Linux A comes back with a fresh slate it grinds to a halt.

The network is no longer recoverable. All nodes need to be shut down and brought back up one-by-one, starting with Linux A.

Please let me know if there's anything I can do to help provide more information. I'll try @richman1000000 's suggestion by adding never_connect=self_ip and see if that helps.

bchavez commented 6 years ago

Also, I'm thinking the Windows TAP9 adapter isn't properly cleared before starting up a new freelan.exe. Soon as I started Windows C's FreeLAN service, Linux A received:

2018-10-20T22:19:52.227225 [TRACE] Received a data message from WindowsC:12021 but its sequence number is outdated (received: 3869, expecting: 4091). Ignoring.
2018-10-20T22:19:52.227474 [TRACE] Received a data message from WindowsC:12021 but its sequence number is outdated (received: 3870, expecting: 4091). Ignoring.
2018-10-20T22:19:52.227843 [TRACE] Received a data message from WindowsC:12021 but its sequence number is outdated (received: 3876, expecting: 4091). Ignoring.

Soon as Linux A received this message, a memory leak occurred.

bchavez commented 6 years ago

I tried @richman1000000 's suggestion of using never_contact=self_ip and no luck.

I also tried perf CPU profiling tools right at the moment when high CPU usage was experienced w/ memory leak. Nothing really sticks out to me... here's a dump of freelan-debug executable (compiled with -g debug compiler switch) executed with: perf record -g ./freelan-debug -f -d -c freelan.cfg

Samples: 299K of event 'cpu-clock', Event count (approx.): 74841000000
+   4.59%  freelan-debug  [kernel.kallsyms]          [k] _raw_spin_unlock_irqrestore
+   2.72%  freelan-debug  libc-2.19.so               [.] _int_malloc
+   2.34%  freelan-debug  libc-2.19.so               [.] _int_free
+   2.27%  freelan-debug  libc-2.19.so               [.] malloc
+   1.92%  freelan-debug  [kernel.kallsyms]          [k] __do_softirq
+   1.60%  freelan-debug  [kernel.kallsyms]          [k] e1000_xmit_frame
+   1.26%  freelan-debug  [kernel.kallsyms]          [k] finish_task_switch
+   1.22%  freelan-debug  freelan-debug              [.] boost::detail::atomic_increment(int*)
+   1.19%  freelan-debug  libcrypto.so.1.0.0         [.] 0x00000000000e90e5
+   1.07%  freelan-debug  freelan-debug              [.] boost::detail::atomic_exchange_and_add(int*, int)
+   1.00%  freelan-debug  freelan-debug              [.] boost::detail::shared_count::shared_count(boost::detail::shared_count const&)
+   0.92%  freelan-debug  freelan-debug              [.] boost::variant<freelan::null_port_index_type, freelan::tap_adapter_port_index_type, freelan::endpoint_port_index_type, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::vo
+   0.89%  freelan-debug  libc-2.19.so               [.] free
+   0.84%  freelan-debug  freelan-debug              [.] boost::detail::function::functor_manager<boost::function<void (boost::system::error_code)> >::manager(boost::detail::function::function_buffer const&, boost::detail::function::function_buffer&, boost::detail
+   0.81%  freelan-debug  [kernel.kallsyms]          [k] sys_epoll_ctl
+   0.76%  freelan-debug  freelan-debug              [.] boost::detail::shared_count::~shared_count()
+   0.72%  freelan-debug  freelan-debug              [.] boost::detail::function::functor_manager<boost::function<void (boost::system::error_code)> >::manage(boost::detail::function::function_buffer const&, boost::detail::function::function_buffer&, boost::detail:
+   0.67%  freelan-debug  freelan-debug              [.] boost::function1<void, boost::system::error_code const&>::clear()
+   0.63%  freelan-debug  freelan-debug              [.] boost::detail::function::functor_manager<boost::_bi::bind_t<void, boost::_mfi::mf2<void, freelan::(anonymous namespace)::results_gatherer<boost::variant<freelan::null_port_index_type, freelan::tap_adapter_po
+   0.62%  freelan-debug  freelan-debug              [.] boost::detail::function::functor_manager<boost::_bi::bind_t<void, boost::_mfi::mf2<void, freelan::(anonymous namespace)::results_gatherer<boost::variant<freelan::null_port_index_type, freelan::tap_adapter_po
+   0.57%  freelan-debug  freelan-debug              [.] boost::function1<void, boost::system::error_code>::clear()
+   0.53%  freelan-debug  freelan-debug              [.] boost::variant<freelan::null_port_index_type, freelan::tap_adapter_port_index_type, freelan::endpoint_port_index_type, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::vo
+   0.50%  freelan-debug  freelan-debug              [.] boost::function1<void, boost::system::error_code>::assign_to_own(boost::function1<void, boost::system::error_code> const&)
Samples: 299K of event 'cpu-clock', Event count (approx.): 74841000000
-   4.59%  freelan-debug  [kernel.kallsyms]          [k] _raw_spin_unlock_irqrestore                                                                                                                                                                                          ?
   - _raw_spin_unlock_irqrestore                                                                                                                                                                                                                                              ¦
      - 89.23% mptspi_qcmd                                                                                                                                                                                                                                                    ¦
           scsi_dispatch_cmd                                                                                                                                                                                                                                                  ¦
           scsi_request_fn                                                                                                                                                                                                                                                    ¦
         + __blk_run_queue                                                                                                                                                                                                                                                    ¦
      + 2.93% try_to_wake_up                                                                                                                                                                                                                                                  ¦
      + 1.97% __wake_up_sync_key                                                                                                                                                                                                                                              ¦
      + 1.75% pagevec_lru_move_fn                                                                                                                                                                                                                                             ¦
      + 1.49% ep_poll                                                                                                                                                                                                                                                         ¦
      + 0.76% ep_scan_ready_list.isra.9                                                                                                                                                                                                                                       ¦
      + 0.71% __skb_recv_datagram                                                                                                                                                                                                                                             ¦
-   2.72%  freelan-debug  libc-2.19.so               [.] _int_malloc                                                                                                                                                                                                          ¦
     _int_malloc                                                                                                                                                                                                                                                              ¦
-   2.34%  freelan-debug  libc-2.19.so               [.] _int_free                                                                                                                                                                                                            ¦
   - _int_free                                                                                                                                                                                                                                                                ¦
      + 43.64% boost::detail::function::functor_manager<boost::_bi::bind_t<void, boost::_mfi::mf2<void, freelan::(anonymous namespace)::results_gatherer<boost::variant<freelan::null_port_index_type, freelan::tap_adapter_port_index_type, freelan::endpoint_port_index_type¦
      + 26.27% boost::detail::function::functor_manager<boost::function<void (boost::system::error_code)> >::manager(boost::detail::function::function_buffer const&, boost::detail::function::function_buffer&, boost::detail::function::functor_manager_operation_type, boos¦
      + 6.93% boost::detail::function::functor_manager<fscp::server::async_send_to(fscp::SharedBuffer const&, unsigned long, boost::asio::ip::basic_endpoint<boost::asio::ip::udp> const&, boost::function<void (boost::system::error_code const&)>)::{lambda()#1}>::manager(b¦
      + 5.27% boost::detail::function::functor_manager<boost::asio::detail::wrapped_handler<boost::asio::io_service&, boost::function<void (boost::system::error_code const&)>, boost::asio::detail::is_continuation_delegated> >::manager(boost::detail::function::function_b¦
      + 4.24% std::_Vector_base<unsigned char, std::allocator<unsigned char> >::_M_deallocate(unsigned char*, unsigned long)                                                                                                                                                  ¦
      + 2.38% boost::asio::asio_handler_deallocate(void*, unsigned long, ...)                                                                                                                                                                                                 ¦
      + 1.76% boost::detail::sp_counted_base::destroy()                                                                                                                                                                                                                       ¦
      + 1.65% boost::detail::function::functor_manager<fscp::SharedBufferHandler<void (*)(std::map<boost::variant<freelan::null_port_index_type, freelan::tap_adapter_port_index_type, freelan::endpoint_port_index_type, boost::detail::variant::void_, boost::detail::varian¦
      + 1.45% std::_Rb_tree<boost::variant<freelan::null_port_index_type, freelan::tap_adapter_port_index_type, freelan::endpoint_port_index_type, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_,¦
      + 1.34% fscp::data_message::get_cleartext(void*, unsigned long, cryptoplus::cipher::cipher_algorithm, void const*, unsigned long, void const*, unsigned long) const                                                                                                     ¦
      + 1.34% boost::asio::asio_handler_allocate(unsigned long, ...)                                                                                                                                                                                                          ¦
      + 1.03% std::_List_base<fscp::SharedBuffer, std::allocator<fscp::SharedBuffer> >::_M_put_node(std::_List_node<fscp::SharedBuffer>*)                                                                                                                                     ¦
      + 0.93% boost::detail::sp_counted_impl_pd<fscp::SharedBuffer*, fscp::SharedBuffer::SharedBuffer<fscp::server::do_async_receive_from()::{lambda(fscp::SharedBuffer const&)#2}>(fscp::SharedBuffer const&, fscp::server::do_async_receive_from()::{lambda(fscp::SharedBuff¦
      + 0.72% fscp::data_message::raw_write(void*, unsigned long, unsigned int, cryptoplus::cipher::cipher_algorithm, void const*, unsigned long, void const*, unsigned long, void const*, unsigned long, fscp::message_type)                                                 ¦
      + 0.52% std::_Rb_tree<boost::variant<freelan::null_port_index_type, freelan::tap_adapter_port_index_type, freelan::endpoint_port_index_type, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_,¦
-   2.27%  freelan-debug  libc-2.19.so               [.] malloc                                                                                                                                                                                                               ¦
   - malloc                                                                                                                                                                                                                                                                   ¦
      + 17.20% boost::detail::function::functor_manager<boost::asio::detail::wrapped_handler<boost::asio::io_service&, boost::function<void (boost::system::error_code const&)>, boost::asio::detail::is_continuation_delegated> >::manager(boost::detail::function::function_¦
      + 16.53% boost::detail::function::functor_manager<boost::function<void (boost::system::error_code)> >::manager(boost::detail::function::function_buffer const&, boost::detail::function::function_buffer&, boost::detail::function::functor_manager_operation_type, boos¦
      + 16.19% boost::asio::asio_handler_allocate(unsigned long, ...)                                                                                                                                                                                                         ¦
      + 11.80% boost::detail::function::functor_manager<boost::_bi::bind_t<void, boost::_mfi::mf2<void, freelan::(anonymous namespace)::results_gatherer<boost::variant<freelan::null_port_index_type, freelan::tap_adapter_port_index_type, freelan::endpoint_port_index_type¦
      + 8.60% boost::detail::function::functor_manager<fscp::server::async_send_to(fscp::SharedBuffer const&, unsigned long, boost::asio::ip::basic_endpoint<boost::asio::ip::udp> const&, boost::function<void (boost::system::error_code const&)>)::{lambda()#1}>::manager(b¦
      + 6.07% std::_Vector_base<unsigned char, std::allocator<unsigned char> >::_M_allocate(unsigned long)                                                                                                                                                                    ¦
      + 3.54% std::_List_base<fscp::SharedBuffer, std::allocator<fscp::SharedBuffer> >::_M_get_node()                                                                                                                                                                         ¦
      + 3.37% bool boost::detail::function::basic_vtable1<void, boost::system::error_code>::assign_to<boost::_bi::bind_t<void, boost::_mfi::mf2<void, freelan::(anonymous namespace)::results_gatherer<boost::variant<freelan::null_port_index_type, freelan::tap_adapter_port¦
      + 3.20% bool boost::detail::function::basic_vtable1<void, std::map<boost::variant<freelan::null_port_index_type, freelan::tap_adapter_port_index_type, freelan::endpoint_port_index_type, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::v¦
        2.70% 0x203d3c206c620063                                                                                                                                                                                                                                              ¦
      + 2.70% boost::shared_array<unsigned char>::shared_array<unsigned char>(unsigned char*)                                                                                                                                                                                 ¦
      + 2.36% fscp::server::do_handle_data(fscp::identity_store const&, boost::asio::ip::basic_endpoint<boost::asio::ip::udp> const&, fscp::data_message const&)                                                                                                              ¦
      + 2.02% std::_Rb_tree<boost::variant<freelan::null_port_index_type, freelan::tap_adapter_port_index_type, freelan::endpoint_port_index_type, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_,¦
      + 1.18% std::_Rb_tree<boost::variant<freelan::null_port_index_type, freelan::tap_adapter_port_index_type, freelan::endpoint_port_index_type, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_,¦
      + 0.51% fscp::server::do_async_receive_from()                                                                                                                                                                                                                           ¦
+   1.92%  freelan-debug  [kernel.kallsyms]          [k] __do_softirq                                                                                                                                                                                                         ¦
+   1.60%  freelan-debug  [kernel.kallsyms]          [k] e1000_xmit_frame                                                                                                                                                                                                     ¦
+   1.26%  freelan-debug  [kernel.kallsyms]          [k] finish_task_switch                                                                                                                                                                                                   ¦
-   1.22%  freelan-debug  freelan-debug              [.] boost::detail::atomic_increment(int*)                                                                                                                                                                                ¦
   - boost::detail::atomic_increment(int*)                                                                                                                                                                                                                                    ¦
      - 98.38% boost::detail::sp_counted_base::add_ref_copy()                                                                                                                                                                                                                 ¦
         - boost::detail::shared_count::shared_count(boost::detail::shared_count const&)                                                                                                                                                                                      ¦
            - 37.13% boost::shared_ptr<freelan::(anonymous namespace)::results_gatherer<boost::variant<freelan::null_port_index_type, freelan::tap_adapter_port_index_type, freelan::endpoint_port_index_type, boost::detail::variant::void_, boost::detail::variant::void_, b¦
               - 96.76% boost::_bi::value<boost::shared_ptr<freelan::(anonymous namespace)::results_gatherer<boost::variant<freelan::null_port_index_type, freelan::tap_adapter_port_index_type, freelan::endpoint_port_index_type, boost::detail::variant::void_, boost::deta¦
                  - 93.08% boost::_bi::storage1<boost::_bi::value<boost::shared_ptr<freelan::(anonymous namespace)::results_gatherer<boost::variant<freelan::null_port_index_type, freelan::tap_adapter_port_index_type, freelan::endpoint_port_index_type, boost::detail::var¦
                       boost::_bi::storage2<boost::_bi::value<boost::shared_ptr<freelan::(anonymous namespace)::results_gatherer<boost::variant<freelan::null_port_index_type, freelan::tap_adapter_port_index_type, freelan::endpoint_port_index_type, boost::detail::variant¦
                       boost::_bi::storage3<boost::_bi::value<boost::shared_ptr<freelan::(anonymous namespace)::results_gatherer<boost::variant<freelan::null_port_index_type, freelan::tap_adapter_port_index_type, freelan::endpoint_port_index_type, boost::detail::variant¦
                     + boost::_bi::list3<boost::_bi::value<boost::shared_ptr<freelan::(anonymous namespace)::results_gatherer<boost::variant<freelan::null_port_index_type, freelan::tap_adapter_port_index_type, freelan::endpoint_port_index_type, boost::detail::variant::v¦
                  + 2.02% boost::_bi::storage1<boost::_bi::value<boost::shared_ptr<freelan::(anonymous namespace)::results_gatherer<boost::variant<freelan::null_port_index_type, freelan::tap_adapter_port_index_type, freelan::endpoint_port_index_type, boost::detail::vari¦
                  + 1.79% boost::_bi::storage2<boost::_bi::value<boost::shared_ptr<freelan::(anonymous namespace)::results_gatherer<boost::variant<freelan::null_port_index_type, freelan::tap_adapter_port_index_type, freelan::endpoint_port_index_type, boost::detail::vari¦
                  + 1.79% boost::_bi::list3<boost::_bi::value<boost::shared_ptr<freelan::(anonymous namespace)::results_gatherer<boost::variant<freelan::null_port_index_type, freelan::tap_adapter_port_index_type, freelan::endpoint_port_index_type, boost::detail::variant¦
                  + 1.32% boost::_bi::storage3<boost::_bi::value<boost::shared_ptr<freelan::(anonymous namespace)::results_gatherer<boost::variant<freelan::null_port_index_type, freelan::tap_adapter_port_index_type, freelan::endpoint_port_index_type, boost::detail::vari¦
               + 1.88% boost::_bi::value<boost::shared_ptr<freelan::(anonymous namespace)::results_gatherer<boost::variant<freelan::null_port_index_type, freelan::tap_adapter_port_index_type, freelan::endpoint_port_index_type, boost::detail::variant::void_, boost::detai¦
               + 1.35% freelan::switch_::async_write(boost::variant<freelan::null_port_index_type, freelan::tap_adapter_port_index_type, freelan::endpoint_port_index_type, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost¦
            + 22.46% boost::shared_ptr<fscp::SharedBuffer>::shared_ptr(boost::shared_ptr<fscp::SharedBuffer> const&)                                                                                                                                                          ¦
            + 13.64% boost::shared_array<unsigned char>::shared_array(boost::shared_array<unsigned char> const&)                                                                                                                                                              ¦
            + 13.61% boost::shared_ptr<asiotap::posix_tap_adapter>::shared_ptr(boost::shared_ptr<asiotap::posix_tap_adapter> const&)                                                                                                                                          ¦
            + 5.50% boost::shared_ptr<x509_st>::shared_ptr(boost::shared_ptr<x509_st> const&)                                                                                                                                                                                 ¦
            + 4.39% boost::shared_ptr<evp_pkey_st>::shared_ptr(boost::shared_ptr<evp_pkey_st> const&)                                                                                                                                                                         ¦
            + 2.96% boost::shared_ptr<boost::asio::ip::basic_endpoint<boost::asio::ip::udp> >::shared_ptr(boost::shared_ptr<boost::asio::ip::basic_endpoint<boost::asio::ip::udp> > const&)                                                                                   ¦
      + 1.62% boost::detail::shared_count::shared_count(boost::detail::shared_count const&)                                                                                                                                           

However, I do see a significant amount of traffic hit the VPN network at the moment high CPU usage and memory leak occur:

2274

richman1000000 commented 6 years ago

Hello! also on linux you can try echo never > /sys/kernel/mm/transparent_hugepage/enabled

bchavez commented 6 years ago

Quick update, I've been making some headway tracking down this memory leak.

I have a setup that can finally reproduce the issue with full IDE debugging support on Windows. Basically, I swapped Linux A with Windows A.

Memory Heap Allocation Analysis

Below is a heap allocation snapshot from 0 seconds to 74 seconds with four heap snapshots taken along the way. You'll notice a massive heap allocation between Snapshot 3 and Snapshot 4. In a few seconds, we go from a nominal memory usage to +200MB total bytes (+166MB from Snapshot 3 to Snapshot 4).

screen_2285

Next, let's take a closer look at where the bulk of the +166MB of allocations of unsigned char[] came from between Snapshot 3 and Snapshot 4. Below is a sample instance of such an allocation and its associated call stack:

devenv_2281

The vast majority of allocations instances are coming from the call stack shown above. Prime suspects are fscp:do_send_data -> fscp:do_send_data_to_session using the SharedBuffer in some kind of attempt to send data: https://github.com/freelan-developers/freelan/blob/787f0a45b19e33cbc53d718faa6b90b7f25a8c5f/libs/fscp/src/server.cpp#L2071-L2077

https://github.com/freelan-developers/freelan/blob/787f0a45b19e33cbc53d718faa6b90b7f25a8c5f/libs/fscp/src/server.cpp#L2101-L2157

Initial Theory

As I mentioned earlier, the memory leak (or high CPU usage) usually manifests itself when FreeLAN is configured in a star topology with about 10 nodes connected through 1 central node.

chrome_2287

My gut feeling right now, after the central node restarts sessions with the other nodes on the overlay some kind of ambient state from the edge nodes interferes with the central node's ability to fscp:do_send_data_to_session. The sends are indefinitely queued inside the central node, but never cleared, and thus, resulting in a memory leak.

Key Areas for Further Investigation

If anyone has more thoughts on the subject, please let me know. :sunglasses:

Thanks, Brian

:dash: :walking: "Bubbles of gas in my brain... Send me off balance, it's not enough"

s-vincent commented 6 years ago

Hi @bchavez, thanks you very much for deep tests and investigation.

After quick look at the code, I also agree it may be related to SharedBuffer. I will try to work on that this week-end.

bchavez commented 6 years ago

Hello,

I did more testing and was able to find a reliable way to reproduce the memory leak with just 2 nodes.

However, in researching the issue, the instructions (and code) to reproduce the issue the might potentially expose a Denial of Service vulnerability in FreeLAN. Please let me know the best way to disclose the issue. I can do so publicly (by creating a new GitHub issue) or privately.

If privately:

Here is a demo of the memory leak and Denial of Service on demand and in action:

screen_2306

/cc @ereOn @s-vincent

Thanks, Brian

:beach_umbrella: :trumpet: Beach Boys - Good Vibrations (Nick Warren bootleg)

s-vincent commented 5 years ago

@bchavez I finally have some time to look at this issue. I already reproduce memory leaks with your scenario. Now time to investigate in source code.

s-vincent commented 5 years ago

Here is what I see:

My thought about that:

I decrease a bit the size of packet buffer and there is way less memory consumption.

We have two main solutions to handle this:

@ereOn is it normal to have such a high buffer size (65536) ? UDP protocol allows it but we are still limited by MTU of the interface to send it in one piece or else use IP fragmentation. But IP fragmentation is not allowed by router on the internet AFAIK (and performance is not good).

@bchavez Can you test with a SharedBuffer size of 4096 (instead of 65536) in libs/fscp/src/server.cpp ?

I would personnaly go for second proposition. At least we can let people test it in real use-cases and see if they have problems with packet size.

What do you think ?

bchavez commented 5 years ago

Hi @s-vincent ,

I'm happy you were able to reproduce the issue. Also, thanks very much for looking into the issue. You bring up some great points.

I did some testing with 4096 SharedBuffer size, and the situation seems to be greatly improved. However, IMHO, we still have a problem with monotonically increasing memory usage. In the image below, I ran three flood events.

devenv_2463

Initially, when the first flood of traffic occurs, we instantly get an increase in memory usage. The 4096 buffer memory ramp up is much less that before; which is great. However, if we sustain this level of traffic against the host, the host will continue consuming memory albeit at a slower rate. It should be noted, that we do have a reliable way to consume all memory on the target host. Additionally, the consumed memory is never reclaimed by the operating system after the floods stop.

I think we need an upper-bound limit or have some mechanism that frees these SharedBuffers after a period of time.

Also, as a quick side note, during a flood event, the processing of packets on the FreeLAN generally stops anyway. So, the consequences of blocking processing in solution No.1 may not matter much?

cmd_2464

I will continue to do more testing this weekend.

Thanks again, Brian

:fallen_leaf: :leaves: Distance - Falling (ft Alys Be)

bchavez commented 5 years ago

PS. If anyone is interested in help test the 4069 SharedBuffer changes, I have build artifacts available here:

Windows

Ubuntu

To build .deb package, make sure this commit is present in your checkout: https://github.com/bchavez/freelan/commit/81601fad7c2cf441c921b52f3941943fd7f24c06. The commit fixes the broken debian packaging system in FreeLAN.

In detail:

Building Freelan on Ubuntu:

  1. Git clone https://github.com/bchavez/freelan/
  2. Git checkout debian/unstable branch.
    • git checkout origin/debian/unstable
  3. Ensure edits (https://github.com/bchavez/freelan/commit/81601fad7c2cf441c921b52f3941943fd7f24c06) are present.
  4. Execute dpkg-buildpackage -b -rfakeroot -us -uc (or dh binary) in freelan root checkout folder. Wait a long time, freelan will build. Then out comes a cute .deb file! hey! :)

(tested on Ubuntu 14.04 and 18.04)

richman1000000 commented 5 years ago

2018-10-20T20:44:15.576079 [TRACE] Received a data message from WindowsA:12021 but no session exists. Ignoring

https://github.com/freelan-developers/freelan/issues/158#issuecomment-325555490 Previously I was writing that Freelan must not Ignore incorrect session connection, and send answer (like session ID:invalid or session ID:terminate) to client - the client will know that session is not valid anymore and terminate session data in buffer.

s-vincent commented 5 years ago

I think we have a circular references issue for the recycling mechanism.

The SharedBuffer holds a shared_ptr which have a deleter based on lambda expressions. During a time lambda expression capture a reference of the SharedBuffer. Even with limiting buffer list and shrinking when necessary memory never be reclaimed.

This link is interresting on the shared_ptr/lambda issue: https://floating.io/2017/07/lambda-shared_ptr-memory-leak/.

In 82830f2900110bf855014bf2ca3176db2c50c539 I disabled the recycling mechanism.

@bchavez Can you please test this one ?

@ereOn This stuff seems difficult to solve it with current SharedBuffer architecture. Also the effective recycle action happened in a (possibly) another strand so asynchronically which adds complexity. I will try to think about a solution but if you have something in your mind, let's share it :).

bchavez commented 5 years ago

Thanks @s-vincent , I'll test the new changes with disabled recycling.


Re: 4096 SharedBuffer

Just wanted to add an extra note on testing the 4096 SharedBuffer changes. Under a real-world deployment as described in the comment here: https://github.com/freelan-developers/freelan/issues/74#issuecomment-431638703, I bootstrapped the FreeLAN network bringing up each node one by one starting with Linux A

After the network was stable, I rebooted Linux A. Linux A experienced very high CPU usage after reboot. I could not recover network communication between any nodes. I still think there's something fundamentally incorrect under the hood as @richman1000000 alluded to. So I'm glad @s-vincent was able to identify something more fundamental that could be causing issues.


@richman1000000 , I think you have a good idea but we should also keep in mind that any host that experiences issues like this should be able to handle invalid sessions without the help or aid of other hosts resetting their buffers. With networking code, we can't assume all nodes are trustworthy. A malicious attacker can choose to ignore any "reset" messages and still cause harm to the target host.


I'll report back with results on disabled recycling shortly.

bchavez commented 5 years ago

Hi @s-vincent ,

I tested the disabled recycling mechanism and we have very good results. Excellent work Sebastien. We now have a sawtooth pattern in memory usage when flood attacks occur. Memory is now reclaimed by the operating system:

taskmgr_2473

The freelan.exe recovers to a respectable 13 MB. Peak memory usage during a flood attack reached up to 237 MB (unbounded) and memory consumption only lasted for the duration of the flood attack.


Test Environment and Topology

powerpnt_2476


Important Observations and Notes

Overall, this is a very good step in the right direction. I will continue to do more testing.

Thanks, Brian

:car: :blue_car: "Let the good times roll..."

bchavez commented 5 years ago

Okay, I did more testing with disabled recycling mechanism, but this time, with a real-world setup with 13 VPN nodes:

47263107-ae186600-d4af-11e8-9e55-19c3edc04d31

If you restart Linux A, the freelan process consumes 50%-90% CPU and will experience unbounded high memory usage. Sometimes its only high CPU usage, other times, it's high CPU usage & memory leak:

mintty_2478

mintty_2479

So, we still have a memory leak somewhere. The VPN network cannot recover after Linux A restarts. The freelan process on Linux A gets wildly out of control and Linux A becomes unresponsive to terminal input.

Notes:

I'll try to run a profiler to see if I can pick up where these exhaustive CPU cycles are being spent.

-Brian

s-vincent commented 5 years ago

Hi @bchavez,

Thanks for testing!

In f499f9a3eae3318ff38ffcb69978c74322026440 I disable recycling for socket receive buffers. This is the last place where we use recycle mechanism.

bchavez commented 5 years ago

Thanks @s-vincent,

Results are roughly the same as last recycling tests.

Real-World, 13 Node Topology Test

In the real-world test topology with 13 nodes, when Linux A reboots, we still have either: 1) High CPU usage on Linux A or 2) High CPU usage & memory leak on Linux A.

50029778-c33ddc80-ffa8-11e8-9052-dc98c9252916

Case 1) High-CPU usage

After Linux A reboot... when we have high CPU usage on Linux A, typically there is a burst of traffic on the VPN network seen on almost all nodes. However, there's usually one node on the 13 node VPN network that will sustain high "echo"/chatter traffic with Linux A. The selection of the node is non-deterministic and appears to be randomly chosen. The image below is a sample of the this VPN "echo" traffic pattern between both VPN hosts (the random node and Linux A):

meter_2484

The log for the random node engaged with high echo traffic with Linux A (192.168.0.5) is shown below:

2018-12-15T12:28:45.695044 [TRACE] Received a data message from 192.168.0.5:12021 but its sequence number is outdated (received: 143037, expecting: 143039). Ignoring.
2018-12-15T12:28:46.114068 [TRACE] Received a data message from 192.168.0.5:12021 but its sequence number is outdated (received: 144912, expecting: 144913). Ignoring.
2018-12-15T12:28:46.162071 [TRACE] Received a data message from 192.168.0.5:12021 but its sequence number is outdated (received: 149536, expecting: 149538). Ignoring.
2018-12-15T12:28:46.163071 [TRACE] Received a data message from 192.168.0.5:12021 but its sequence number is outdated (received: 149630, expecting: 149632). Ignoring.
2018-12-15T12:28:46.229075 [TRACE] Received a data message from 192.168.0.5:12021 but its sequence number is outdated (received: 156292, expecting: 156298). Ignoring.
2018-12-15T12:28:46.239075 [TRACE] Received a data message from 192.168.0.5:12021 but its sequence number is outdated (received: 157383, expecting: 157384). Ignoring.
2018-12-15T12:28:46.274077 [TRACE] Received a data message from 192.168.0.5:12021 but its sequence number is outdated (received: 161073, expecting: 161074). Ignoring.
2018-12-15T12:28:46.289078 [TRACE] Received a data message from 192.168.0.5:12021 but its sequence number is outdated (received: 162609, expecting: 162611). Ignoring.
2018-12-15T12:28:46.481089 [TRACE] Received a data message from 192.168.0.5:12021 but its sequence number is outdated (received: 180586, expecting: 180592). Ignoring.
2018-12-15T12:28:46.687101 [TRACE] Received a data message from 192.168.0.5:12021 but its sequence number is outdated (received: 199147, expecting: 199151). Ignoring.
2018-12-15T12:28:46.755105 [TRACE] Received a data message from 192.168.0.5:12021 but its sequence number is outdated (received: 205321, expecting: 205332). Ignoring.
2018-12-15T12:28:46.773106 [TRACE] Received a data message from 192.168.0.5:12021 but its sequence number is outdated (received: 206956, expecting: 206962). Ignoring.
2018-12-15T12:28:46.809108 [TRACE] Received a data message from 192.168.0.5:12021 but its sequence number is outdated (received: 209948, expecting: 209952). Ignoring.
2018-12-15T12:28:46.935115 [TRACE] Received a data message from 192.168.0.5:12021 but its sequence number is outdated (received: 220466, expecting: 220471). Ignoring.
2018-12-15T12:28:50.471317 [DEBUG] Resolving 192.168.0.5:12021 for potential contact...
2018-12-15T12:28:50.471317 [DEBUG] A session already exists with 192.168.0.5:12021 (at 192.168.0.5:12021). Not contacting again.
2018-12-15T12:29:17.517864 [TRACE] Received a data message from 192.168.0.5:12021 but its sequence number is outdated (received: 540960, expecting: 540961). Ignoring.
2018-12-15T12:29:20.472033 [DEBUG] Resolving 192.168.0.5:12021 for potential contact...
2018-12-15T12:29:20.472033 [DEBUG] A session already exists with 192.168.0.5:12021 (at 192.168.0.5:12021). Not contacting again.
2018-12-15T12:29:37.425003 [TRACE] Received a data message from 192.168.0.5:12021 but its sequence number is outdated (received: 746037, expecting: 746038). Ignoring.
2018-12-15T12:29:50.472749 [DEBUG] Resolving 192.168.0.5:12021 for potential contact...
2018-12-15T12:29:50.472749 [DEBUG] A session already exists with 192.168.0.5:12021 (at 192.168.0.5:12021). Not contacting again.
2018-12-15T12:30:02.699449 [TRACE] Received a data message from 192.168.0.5:12021 but its sequence number is outdated (received: 1009699, expecting: 1009700). Ignoring.
2018-12-15T12:30:20.473465 [DEBUG] Resolving 192.168.0.5:12021 for potential contact...
2018-12-15T12:30:23.629646 [DEBUG] A session already exists with 192.168.0.5:12021 (at 192.168.0.5:12021). Not contacting again.

Not much more to discern in this test topology

Case 2) is harder to test because high memory usage usually slows down Linux A terminal beyond usability. Also, I have some fear of losing control over Linux A host.



Isolated 4 Node Topology Test

I also have some interesting results in the following 4 node test topology. In the following test scenario, I inject several flood events on Node A (Windows) which starves the other nodes (Node 1, 2, 3) from their network packets from being processed on Node A (Windows).

50024379-0c853080-ff97-11e8-8688-55d2c1fbc764

After several flood events (and when flooding has stopped), I immediately restart the freelan.exe process on Node A (Windows) in the image above. Occasionally, I'm able to reproduce high-CPU usage on Node A (Windows). Below is a CPU process profile during persistent 40-50% CPU usage.

To clarify:

The results are reminiscent of the real-world 13-node test, case 1) when Linux A undergoes high-CPU usage only.

devenv_2486

(CPU-Profiling, click image below for larger screen and zoom) devenv_2489

Occasionally, in this 4-node topology setup, I can reproduce high-CPU usage & memory leak. Again, eerily similar to the real-world 13-node case 2 that Linux A undergoes when the terminal becomes unresponsive.

In the following image, almost 2GB of memory is consumed. This occurred with NO FLOODING whatsoever. Only a simple process restart on Node A (Windows). This memory ramp up happened on its own. The only thing I did was restart the freelan.exe process on Node A (Windows) with the debugger.

devenv_2482

I paused the debugger at the very peak of the last ramp to see what was going on under the hood. My debugger stopped in this section of code deep inside boost::bind(). The call stack was several layers deep in boost:bind() -> boost:list() -> boost:storage(). The callstack had to be unwound before I finally ended up here:

jpegview_2495

https://github.com/freelan-developers/freelan/blob/f499f9a3eae3318ff38ffcb69978c74322026440/libs/fscp/src/server.cpp#L864-L897

Notice, the this->m_write_queue { size = 20556 }. Is this of any particular interest? Perhaps, is it possible for this write_queue to get backed up and grow unbounded in the event that some IO operation that never completes?

I'm going to try to limit the write_queue size and see if this helps control the unbounded memory growth.

-Brian

s-vincent commented 5 years ago

I found a strange case when we try to relay packet from one host to itself. At that time CPU usage was a bit high.

I fix that in 57423405f69da850d10a6fbd4b4cade9b422e886.

bchavez commented 5 years ago

Hey @s-vincent , 5742340 has a build failure?

https://travis-ci.org/freelan-developers/freelan/jobs/468685610#L1232

I also have the same build failure on Ubuntu 14.04 build box.

root@test:~/freelan22# g++ --version
g++ (Ubuntu 4.8.4-2ubuntu1~14.04.4) 4.8.4
Copyright (C) 2013 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
build/release/libs/freelan/src/switch.cpp: In member function ‘void freelan::switch_::async_write(freelan::port_index_type, boost::asio::const_buffer, freelan::switch_::multi_write_handler_type)’:
build/release/libs/freelan/src/switch.cpp:142:13: error: ambiguous overload for ‘operator!=’ (operand types are ‘freelan::port_index_type {aka boost::variant<freelan::null_port_index_type, freelan::tap_adapter_port_index_type, freelan::endpoint_port_index_type>}’ and ‘const boost::variant<freelan::null_port_index_type, freelan::tap_adapter_port_index_type, freelan::endpoint_port_index_type>’)
    if(index != target)
             ^

Compiles okay without problems on Ubuntu 18.04.

g++ (Ubuntu 7.3.0-27ubuntu1~18.04) 7.3.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
s-vincent commented 5 years ago

It should be good now with latest revision. Build 445 passed on TravisCI.

bchavez commented 5 years ago

Hi @s-vincent,

We have VERY GOOD :tada: test results with the latest changes at /master 780d8ef49c1c8c9c3d8ad286b1cf6cc245c2601e. I think 5742340 fixed many issues, including the high-CPU usage and ambient memory leak after process reboots.

I tested both 13-node real-world and 4-node topology setups.

:heavy_check_mark: The "echo"/chatter traffic on VPN after process reboots is gone. :heavy_check_mark: The high-CPU usage after process reboots appears to be gone. :heavy_check_mark: The high-CPU usage & memory leak after process reboots appear to be gone. :heavy_check_mark: The VPN network seems to fully recover after a flood attack. :heavy_check_mark: The VPN network seems to fully recover after process reboots. :heavy_check_mark: Monotonically increasing memory usage is no longer observed. (We have sawtooth pattern now, see image below). :warning: During a flood attack, VPN network traffic between all nodes still stops. :warning: During a flood attack, memory usage continues to increase without limit or bound. If the flood can be sustained against a target, available memory can be depleted. (see image below)

During flood attack unbounded memory sawtooth pattern is observed

devenv_2499

I think the last two warnings can be resolved by some extra threading changes or buffer architecture changes, but overall, we're in a very good spot. This is possibly the most stable source in FreeLAN history.

Great work Sebastien! I will continue to monitor and test the stability of the latest changes in the coming weeks.

For others who wish to test: Windows Binaries: https://ci.appveyor.com/project/bchavez/freelan/build/job/qa7belfqbo90hfd8/artifacts

Ubuntu (build from source)

https://github.com/bchavez/freelan/tree/debian/unstable

Brian

:walking: :sunglasses: "So don't delay... act now supplies are running out..."