freelan-developers / freelan

The main freelan repository.
http://www.freelan.org
Other
1.35k stars 200 forks source link

Possible Memory Leak in Freelan 2.0 #74

Closed bchavez closed 5 years ago

bchavez commented 9 years ago

Hi,

I downloaded Freelan 2.0 from the releases tab here on Github.

I've installed Freelan 2.0 on Windows (via x64 installer on release tab) and memory seems stable at about 95MB.

I've compiled, built, and installed Freelan 2.0 (via source on relase tab) on Ubuntu 14.04.2 LTS, all compiles fine and runs OK.

I have a 5 windows nodes and 1 Linux node setup in a star topology. The Linux node is acting as a switch/relay for the 5 windows nodes.

After about 5 minutes, the VIRT memory on Linux seems to explode into several GB. Here's an htop screen shot of the memory usage:

screen400

After about 10 minutes the system begins to become non-responsive. :(

Any idea on how to begin debugging this kind of problem?

The network is using certificate based authentication on all nodes.

Thanks, Brian

I guess I will try using passphrase auth and see if that helps .... hmm.

richman1000000 commented 5 years ago

@richman1000000 , I think you have a good idea but we should also keep in mind that any host that experiences issues like this should be able to handle invalid sessions without the help or aid of other hosts resetting their buffers. With networking code, we can't assume all nodes are trustworthy. A malicious attacker can choose to ignore any "reset" messages and still cause harm to the target host.

I'll report back with results on disabled recycling shortly.

You can try to implement tls secret as openvpn is using it and really successfully https://community.openvpn.net/openvpn/wiki/Hardening

The primary benefit is that an unauthenticated client cannot cause the same CPU/crypto load against a server as the junk traffic can be dropped much sooner. This can aid in mitigating denial-of-service attempts.

PS. I had a client under flood and DoS as result in my experience. It is a nightmare.

s-vincent commented 5 years ago

@bchavez @richman1000000 Again thanks for testing in real-world deployment, we appreciate!

I made a tentative to limit PRESENTATION message processing to 512 per 10 seconds in revision 9c217b42d25fd35ebc66a93aba601c6837734835. This 512 number is totally arbitrary, we may change it if it is too much high (or low).

bchavez commented 5 years ago

Hi @s-vincent ,

Sorry for the delay. Below is the flood attack memory usage graphs for the 4-node test topology. Both images are measurements on Node A (Windows)'s freelan.exe process. I ran two tests with 1) with one thread count, and 2) with eight thread count command-line arguments.

4-node network: freelan.exe -d -t 1 / Thread Count 1

devenv_2502-1thread

4-node network: freelan.exe -d -t 8 / Thread Count 8

devenv_2504

Test notes and observations

In both cases,

Conclusion, the PRESENTATION filtering doesn't appear to have any effect. Maybe increasing the limiter values could have a more significant effect.

richman1000000 commented 5 years ago

limit PRESENTATION message processing to 512 per 10 seconds

can you place this option in freelan.cfg? so we can change it "on the fly"?

flood attack

In my opinion since FreeLAN is UDP based VPN there is no build-in way to protect from incoming UDP flood. Incorrect UDP packages, or UDP packages signed with attackers certificate will consume time and memory of FreeLAN anyway until attackers IP is blocked on another level.

One mechanism that I think is possible to use - reporting wrong traffic. Say someone starts to flood FreeLAN server. FreeLAN logs "incorrect certificate" 100 tmes in 2 seconds. Say fail2ban is set to monitor this warnings and it reports that threat-IP to Firewall. Firewall mechanism block all traffic from this threat-IP.

no application can defend itself from UDP or ICMP incoming flood. Only full IDS/IPS solutions able to do this.

PS. I think DDoS prevention is theme for different issue. Here we should focus on stability.

richman1000000 commented 5 years ago

@bchavez if in your test packets is using correct certificate/key - this is not DDoS attack or flood. This is "lots of traffic" scenario.

s-vincent commented 5 years ago

@richman1000000 @bchavez In last revision 590ccd8f226b882501c93809bd5b6aa7585ec47b, I add per host PRESENTATION message limits as well as configurable limit in freelan.cfg and error message when threshold is reached.

bchavez commented 5 years ago

Hi @s-vincent

Test results from 590ccd8 changes in 4-node test topology are outlined below: I didn't see any real difference compared with our last test results.

However, I took a deeper look to gain a better understanding and it appears that HELLO is probably the primary cause for our progress being washed away every time we try to make improvements.

Here's a memory profile of different protocol message types being tested (1 thread count):

jpegview_2507

From the memory profile above, you can see that the HELLO only protocol message floods are currently the biggest contributing factor to the memory ramp ups during a flood attack.

The PRESENTATION only limiter does an EXCELLENT job at deflecting the PRESENTATION floods. As you can see in the image above, PRESENTATION only has virtually no memory usage during a flood attack with the default limiter value. So, this means our work so far has had significant improvements, but only for PRESENTATION flood events.

Again, as with the last test results, and in these isolated HELLO / PRESENTATION tests,

s-vincent commented 5 years ago
* Perhaps we also need to have a limiter on `HELLO`? IIRC, I think `HELLO` and `PRESENTATION` are the only unauthenticated messages that are in the **FSCP** protocol?

@bchavez Sound right. Revision 64ab23ec1541017010d6b075a330c29de3912f35 addresses that.

bchavez commented 5 years ago

Hi @s-vincent ,

The results for 4-node topology tests are very good.

jpegview_2512

I no longer have a way to consume all available memory resources on a target node. Excellent work! :tada: :)

The only thing I can effect with floods is a temporarily suspension of processing of VPN traffic from other nodes for the duration of the flood.

cmd_2510

As I mentioned previously, I suspect mitigating this kind of thread-bound denial-of-service in VPN packets would probably have to come down to some threading changes or maybe some QoS that prioritizes existing authenticated session traffic higher than unauthenticated traffic.

But overall, I think our memory leak has been resolved. Next, I will deploy to the real-world 13 node setup and continue to do testing. The situation is looking very promising!

bchavez commented 5 years ago

It seems we can get into a situation that if a node is restarted a few times via service freelan stop / service freelan start the node can be blacklisted indefinitely.

In real-world 13-node topology, the Linux A relay, the log messages read:

2018-12-18T19:38:05.183898 [WARNING] Received too many HELLO messages from 192.168.0.11:12021, limit is 10 messages per 10 seconds
2018-12-18T19:38:35.169356 [WARNING] Received too many HELLO messages from 192.168.0.11:12021, limit is 10 messages per 10 seconds
2018-12-18T19:39:05.154515 [WARNING] Received too many HELLO messages from 192.168.0.11:12021, limit is 10 messages per 10 seconds
2018-12-18T19:39:35.139681 [WARNING] Received too many HELLO messages from 192.168.0.11:12021, limit is 10 messages per 10 seconds

Notice, each log message is 30 seconds apart. The logs say limit is 10 messages per 10 seconds. It would seem Linux A's expire timer for 192.168.0.11 should have been reset after 30 seconds passes because HELLO messages are normally sent every 30 seconds from 192.168.0.11.

The log messages on 192.168.0.11:

2018-12-18T19:52:11.765321 [DEBUG] Sending HELLO to 192.168.0.5:12021
2018-12-18T19:52:14.765546 [DEBUG] Received no HELLO_RESPONSE from 192.168.0.5:12021 at 192.168.0.5:12021: No HELLO response received (timeout: 00:00:03.000079)
2018-12-18T19:52:41.765386 [DEBUG] Resolving 192.168.0.5:12021 for potential contact...
2018-12-18T19:52:41.765534 [DEBUG] No session exists with 192.168.0.5:12021 (at 192.168.0.5:12021). Contacting...
2018-12-18T19:52:41.765548 [DEBUG] Sending HELLO to 192.168.0.5:12021
2018-12-18T19:52:44.765801 [DEBUG] Received no HELLO_RESPONSE from 192.168.0.5:12021 at 192.168.0.5:12021: No HELLO response received (timeout: 00:00:03.000086)
2018-12-18T19:53:11.765638 [DEBUG] Resolving 192.168.0.5:12021 for potential contact...
2018-12-18T19:53:11.765771 [DEBUG] No session exists with 192.168.0.5:12021 (at 192.168.0.5:12021). Contacting...
2018-12-18T19:53:11.765784 [DEBUG] Sending HELLO to 192.168.0.5:12021
2018-12-18T19:53:14.766000 [DEBUG] Received no HELLO_RESPONSE from 192.168.0.5:12021 at 192.168.0.5:12021: No HELLO response received (timeout: 00:00:03.000078)

Are we sure expiration time is calculated correctly? No flooding in this real-world test.

Additionally, I performed the following:

And same results:

2018-12-18T20:05:08.262746 [WARNING] Received too many HELLO messages from 192.168.0.11:12021, limit is 10 messages per 10 seconds
2018-12-18T20:17:45.655033 [DEBUG] Sending HELLO to 192.168.0.5:12021
2018-12-18T20:17:48.655585 [DEBUG] Received no HELLO_RESPONSE from 192.168.0.5:12021 at 192.168.0.5:12021: No HELLO response received (timeout: 00:00:03.000123)

Also, waited 38 minutes and still same results for 192.168.0.11

s-vincent commented 5 years ago

Oh my bad, forgot the launch of the hello message reset timer :/.

Fixed in 42ffdf03995228849985f278a85eb6059c7abb7b.

s-vincent commented 5 years ago

@bchavez

But overall, I think our memory leak has been resolved.

I close the ticket. Feel free to reopen it if needed.

bchavez commented 5 years ago

So far so good :+1:. The networks have been very stable. :)