adrienverge / openfortivpn

Client for PPP+TLS VPN tunnel services
GNU General Public License v3.0
2.64k stars 317 forks source link

Memory leak #540

Open dovi5988 opened 4 years ago

dovi5988 commented 4 years ago

Hi,

I have one host with multiple sessions out. It seems there is some sort of memory leak. dmesg is showing: [5112108.572250] [ 2381] 1002 2381 5956 37 13 0 0 perl [5112108.577103] Out of memory: Kill process 24130 (openfortivpn) score 295 or sacrifice child [5112108.581677] Killed process 24130 (openfortivpn), UID 0, total-vm:1245576kB, anon-rss:884460kB, file-rss:0kB, shmem-rss:0kB

How would I go about troubleshooting this? Is there any logs that I can connect to post here to debug further?

DimitriPapadopoulos commented 4 years ago

I don't know much about the OOM killer of the Linux kernel. As far as I can understand your machine is our of memory and the kernel selects a process to kill in an attempt to recover. Which exact part of the log suggest a memory leak in openfofrtivpn? Perhaps total-vm:1245576kB which seems to imply that openfortivpn is consuming > 1 GB of RAM? Also an oom_score of 295 is pretty bad as far as I can understand.

Which exact Linux distribution is this? Which exact version of the kernel?

DimitriPapadopoulos commented 4 years ago

Also how much RAM do you have on your system? How many openfortivpn processes do you run in parallel and how much memory do they consume?

DimitriPapadopoulos commented 4 years ago

See for example :

dovi5988 commented 4 years ago

Sorry for not updating sooner. The box is a VPS running on Digital Ocean. I have two public IP's on the FortiNet device and I have a connectivity to each one. The VPS has 3GB of ram. Top shows each connection using 13 % of ram:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
25264 root      20   0  737748 383988   2828 S   0.0 13.2  10:14.85 openfortivpn
25301 root      20   0  728668 383196   2828 S   0.0 13.1  12:59.74 openfortivpn

The box shows the actual total as 2846:

[root@noc2 ~]# free -m
              total        used        free      shared  buff/cache   available
Mem:           2846        1515         250         152        1080        1022
Swap:             0           0           0

So 2846 * 0.13 is 370MB. Does it make sense that each one is using 370MB?

I am going to watch both processes and see it's usage over the next week.

DimitriPapadopoulos commented 4 years ago

I don't see openfortivpn using that much memory on my Ubuntu 18.04 workstation:

top - 00:38:45 up  4:27,  1 user,  load average: 0.20, 0.31, 0.27
Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
%Cpu(s):  2.8 us,  0.8 sy,  0.0 ni, 96.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 16353120 total,  9862408 free,  2351852 used,  4138860 buff/cache
KiB Swap:  1003516 total,  1003516 free,        0 used. 13506796 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
24784 root      20   0  402096   6536   5208 S   0.7  0.0   0:06.68 openfortivpn

Especially the resident size RES is worrying in your case (383988) compared to the size in my case (6536).

Which Linux distribution is this? Have you built and installed openfortivpn from source? Or used a package bundled with your distribution?

What about the parameters passed to the openfortivpn command and/or the contents of the openfortivpn configuration file?

DimitriPapadopoulos commented 4 years ago

I really don't see the RES value changing over time here, whatever I do: downloading/uploading a large file, remote graphical session, etc.

top - 10:20:41 up  2:30,  1 user,  load average: 1.11, 0.42, 0.20
Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
%Cpu(s): 15.2 us, 14.0 sy,  0.0 ni, 68.4 id,  0.6 wa,  0.0 hi,  1.8 si,  0.0 st
KiB Mem : 16353128 total, 11464812 free,  2454216 used,  2434100 buff/cache
KiB Swap:  1003516 total,  1003516 free,        0 used. 13371552 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 5367 root      20   0  402096   6500   5172 S  87.0  0.0   1:10.73 openfortivpn

On the other hand I wouldn't expect memory leaks during "usual" operation, they would have been already reported. Is there anything special in your setup? Have you tried running a single instance of openfortivpn?

dovi5988 commented 4 years ago

Hi,

I did not see anything further in dmesg. I did create systemd scripts to start openfortivpn. Currently when I check the status for one of them I see: Feb 13 14:29:52 noc2 openfortivpn[25644]: INFO: Remote gateway has allocated a VPN. Feb 13 14:29:53 noc2 openfortivpn[25644]: ERROR: forkpty: Cannot allocate memory Feb 13 14:29:53 noc2 openfortivpn[25644]: INFO: Closed connection to gateway. Feb 13 14:29:53 noc2 openfortivpn[25644]: INFO: Logged out. Feb 13 14:30:53 noc2 openfortivpn[25644]: INFO: Connected to gateway. Feb 13 14:30:53 noc2 openfortivpn[25644]: INFO: Authenticated. Feb 13 14:30:53 noc2 openfortivpn[25644]: INFO: Remote gateway has allocated a VPN. Feb 13 14:30:53 noc2 openfortivpn[25644]: ERROR: forkpty: Cannot allocate memory Feb 13 14:30:53 noc2 openfortivpn[25644]: INFO: Closed connection to gateway. Feb 13 14:30:53 noc2 openfortivpn[25644]: INFO: Logged out.

If I then kill the pid for openfortivpn it will start again and come up. The box has more than enough free memory.

[root@noc2 ~]# free -m total used free shared buff/cache available Mem: 2846 1939 693 152 213 646 Swap: 0 0 0 [root@noc2 ~]#

dovi5988 commented 4 years ago

I also think I found another possible issue. My ugly hack to get around this issue was to write a bash script like this: HE_TUNN=$(/usr/sbin/ip route | grep XX.XX.XX.XX | wc -l) if (( $HE_TUNN < 1 )); then /usr/bin/date /usr/bin/echo "starting HE VPN" /usr/bin/echo "HE TUN IS $HE_TUNN" systemctl restart ssl_vpn_he sleep 10 fi

The problem is even if the link was down the route was still in the system so my script was wrothless. If the status is logged out why would I see a route if I was logged out?

DimitriPapadopoulos commented 4 years ago

This doesn't look good: Feb 13 14:30:53 noc2 openfortivpn[25644]: ERROR: forkpty: Cannot allocate memory I really don't know. There's a memory leak somewhere, probably not in openfortivpn. You may get some inspiration from pages such as:

For example you may:

Not sure I understand you last message about your scripts and the routes, but my guess is that openfortivpn cannot not clean up if it is killed abruptly by the system. It's best to let systemd restart openfortivpn for you instead of relying on your script which is indeed a hack :-)

DimitriPapadopoulos commented 4 years ago

Which exact Linux distribution are you running?

dovi5988 commented 4 years ago

[root@noc2 ~]# cat /etc/redhat-release CentOS Linux release 7.7.1908 (Core) [root@noc2 ~]#

DimitriPapadopoulos commented 4 years ago

The Linux distribution is recent, nothing suspect here.

Now openfortivpn does use more memory on your server than on my own machine. Perhaps this is related to the double connection to the two public IP addresses of the FortiGate device. Apart form the memory problem do you see any other issues with openfortivpn (slowness, lost packets)? For testing purposes, could you limit yourself to a single connection to the FortiGate device and see if it helps in any way? Then could you show the output of route before starting openfortivpn, after starting the first instance of openfortivpn and after starting the second instance?

Finally for debugging purposes if would help if you could put yourself in the simplest possible configuration:

You may obfuscate logs and such, but if you do so please obfuscate in a clear and consistent way. For example make sure we can see the two different (though obfuscated) public IP addresses of the FortiGate device and keep these addresses consistent in posts.

Finally I suggest you also investigate possible memory leaks and issues elsewhere:

dovi5988 commented 4 years ago

I am sorry for the delayed response, it's been busy here. While working on an unrelated issue I found in the logs: Feb 23 20:17:47 noc2 openfortivpn: ERROR: socket: Too many open files Feb 23 20:17:47 noc2 openfortivpn: INFO: Closed connection to gateway. Feb 23 20:17:47 noc2 openfortivpn: ERROR: socket: Too many open files Feb 23 20:17:47 noc2 openfortivpn: INFO: Could not log out.

[root@noc2 objects]# cat /proc/19653/limits Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 8388608 unlimited bytes
Max core file size 0 unlimited bytes
Max resident set unlimited unlimited bytes
Max processes 11198 11198 processes Max open files 1024 4096 files
Max locked memory 65536 65536 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 11198 11198 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us

When I look for a count of open fd's I get: [root@noc2 29258]# lsof -p 32263 | grep CLOSE_WAIT | wc -l 1021 [root@noc2 29258]#

More precisely I see this over and over: openforti 32263 root 1019u IPv4 330196028 0t0 TCP noc2.example.org:37380->69.110.110.93:11542 (CLOSE_WAIT) openforti 32263 root 1020u IPv4 330199239 0t0 TCP noc2.example.org:37428->69.110.110.93:11542 (CLOSE_WAIT) openforti 32263 root 1021u IPv4 330202288 0t0 TCP noc2.example.org:37480->69.110.110.93:11542 (CLOSE_WAIT) openforti 32263 root 1022u IPv4 330204070 0t0 TCP noc2.example.org:37534->69.110.110.93:11542 (CLOSE_WAIT) openforti 32263 root 1023u IPv4 330206039 0t0 TCP noc2.example.org:37590->69.110.110.93:11542 (CLOSE_WAIT)

I can write a script to look for OpenForti PID's that more fore then say 900 fd's and systemd will re-launch but if there is a bit I would prefer to fix it.

DimitriPapadopoulos commented 4 years ago

Interesting. So the root cause would be a socket leak in openfortivpn. I'm no certain the socket leak can result in the memory leak you also experience. Are these leaks different issues or the same issue?

Just wondering why this had never been noticed before.

DimitriPapadopoulos commented 4 years ago

Does this happen with a single instance of openfortivpn running? Or with two instances of openfortivpn only?

dovi5988 commented 4 years ago

It's currently happening when we have multiple instances. What's strange is that it will happen to all of them over time. So today it will hit one and tomorrow it will hit another. From what I recall we only had one running and we started having these issues which is why we now have three of them.

I don't know if the memory leak errors are because there are no fd's (so it see's that it's "out of memory") or the leak is causing the sockets to leak or they are one of the same. We can only have one session up if would like. I can then write something that will check every minute for how many files are open for the PID.

DimitriPapadopoulos commented 4 years ago

Another thing you could do is add some debugging code where sockets are opened/closed: close(tunnel->ssl_socket) handle = socket(AF_INET, SOCK_STREAM, 0) close(handle) close(tunnel->ssl_socket)

BTW it occurs to me that we don't check the return value of close() but I don't know whether we should expect the function to fail often on a socket. I can add some code to at least print a warning if close() fails.

Also it would help to know the options passed to openfortivpn. Are you passing the --persistent option for example?

dovi5988 commented 4 years ago

Dimitri,

This is what I am using: /usr/local/bin/openfortivpn XX.XX.XX.XX:12443 -u dovid -p --trusted-cert --persistent 60

DimitriPapadopoulos commented 4 years ago

I would try without --persistent - but then it might be difficult to reproduce the problem if openfortivpn goes down after a few hours. Could you perhaps monitor the number of sockets with/without --persistent?

dovi5988 commented 4 years ago

Dimitri,

No problem. I wrote a script that will check how many connections per openfortipd there are that have a connection that is in CLOSE_WAIT.

I was reading over https://blog.cloudflare.com/this-is-strictly-a-violation-of-the-tcp-specification/ and decided to do netstat -anp to show the pid. What I am seeing is: tcp 1 0 y.y.y.y:42124 x.x.x.x:11542 CLOSE_WAIT 5763/pppd
tcp 1 0 y.y.y.y:46780 x.x.x.x:11542 CLOSE_WAIT 7458/pppd
tcp 1 0 y.y.y.y:43604 x.x.x.x:11542 CLOSE_WAIT 7458/pppd
tcp 1 0 y.y.y.y:60180 x.x.x.x:11542 CLOSE_WAIT 7505/pppd
tcp 1 0 y.y.y.y:41730 x.x.x.x:11542 CLOSE_WAIT 5763/pppd
tcp 1 0 y.y.y.y:59662 x.x.x.x:11542 CLOSE_WAIT 5763/pppd

Is the bug with pppd rather then openfortivpn?

dovi5988 commented 4 years ago

It seems like there is a leak every 6 minutes. Below is the output of my log. I am going to remove the persistent flag and will see how it goes. From what I recall I needed it as without it, the FortiNet device would end the connection after 8 hours and the systemd script would just sit there (since openforivpn never actually exited).

[2020-02-26 05:50:02][25594] Count for 25594 is 90 so we will do nothing for now [2020-02-26 05:51:01][25503] Count for 25503 is 91 so we will do nothing for now [2020-02-26 05:51:01][25559] Count for 25559 is 90 so we will do nothing for now [2020-02-26 05:51:01][25594] Count for 25594 is 90 so we will do nothing for now [2020-02-26 05:52:01][25503] Count for 25503 is 92 so we will do nothing for now [2020-02-26 05:52:01][25559] Count for 25559 is 90 so we will do nothing for now [2020-02-26 05:52:01][25594] Count for 25594 is 91 so we will do nothing for now [2020-02-26 05:53:01][25503] Count for 25503 is 92 so we will do nothing for now [2020-02-26 05:53:01][25559] Count for 25559 is 90 so we will do nothing for now [2020-02-26 05:53:01][25594] Count for 25594 is 91 so we will do nothing for now [2020-02-26 05:54:01][25503] Count for 25503 is 92 so we will do nothing for now [2020-02-26 05:54:01][25559] Count for 25559 is 90 so we will do nothing for now [2020-02-26 05:54:01][25594] Count for 25594 is 91 so we will do nothing for now [2020-02-26 05:55:02][25503] Count for 25503 is 92 so we will do nothing for now [2020-02-26 05:55:02][25559] Count for 25559 is 90 so we will do nothing for now [2020-02-26 05:55:02][25594] Count for 25594 is 91 so we will do nothing for now [2020-02-26 05:56:02][25503] Count for 25503 is 92 so we will do nothing for now [2020-02-26 05:56:02][25559] Count for 25559 is 90 so we will do nothing for now [2020-02-26 05:56:03][25594] Count for 25594 is 91 so we will do nothing for now [2020-02-26 05:57:01][25503] Count for 25503 is 92 so we will do nothing for now [2020-02-26 05:57:01][25559] Count for 25559 is 90 so we will do nothing for now [2020-02-26 05:57:01][25594] Count for 25594 is 91 so we will do nothing for now [2020-02-26 05:58:01][25503] Count for 25503 is 93 so we will do nothing for now [2020-02-26 05:58:02][25559] Count for 25559 is 91 so we will do nothing for now [2020-02-26 05:58:02][25594] Count for 25594 is 91 so we will do nothing for now [2020-02-26 05:59:01][25503] Count for 25503 is 93 so we will do nothing for now [2020-02-26 05:59:01][25559] Count for 25559 is 91 so we will do nothing for now [2020-02-26 05:59:01][25594] Count for 25594 is 91 so we will do nothing for now [2020-02-26 06:00:02][25503] Count for 25503 is 93 so we will do nothing for now [2020-02-26 06:00:02][25559] Count for 25559 is 91 so we will do nothing for now [2020-02-26 06:00:02][25594] Count for 25594 is 91 so we will do nothing for now [2020-02-26 06:01:02][25503] Count for 25503 is 93 so we will do nothing for now [2020-02-26 06:01:02][25559] Count for 25559 is 91 so we will do nothing for now [2020-02-26 06:01:02][25594] Count for 25594 is 91 so we will do nothing for now [2020-02-26 06:02:01][25503] Count for 25503 is 93 so we will do nothing for now [2020-02-26 06:02:01][25559] Count for 25559 is 91 so we will do nothing for now [2020-02-26 06:02:02][25594] Count for 25594 is 91 so we will do nothing for now [2020-02-26 06:03:01][25503] Count for 25503 is 93 so we will do nothing for now [2020-02-26 06:03:01][25559] Count for 25559 is 91 so we will do nothing for now [2020-02-26 06:03:01][25594] Count for 25594 is 91 so we will do nothing for now [2020-02-26 06:04:02][25503] Count for 25503 is 94 so we will do nothing for now [2020-02-26 06:04:02][25559] Count for 25559 is 91 so we will do nothing for now [2020-02-26 06:04:02][25594] Count for 25594 is 92 so we will do nothing for now [2020-02-26 06:05:01][25503] Count for 25503 is 94 so we will do nothing for now [2020-02-26 06:05:02][25559] Count for 25559 is 91 so we will do nothing for now [2020-02-26 06:05:02][25594] Count for 25594 is 92 so we will do nothing for now [2020-02-26 06:06:02][25503] Count for 25503 is 94 so we will do nothing for now [2020-02-26 06:06:02][25559] Count for 25559 is 91 so we will do nothing for now [2020-02-26 06:06:02][25594] Count for 25594 is 92 so we will do nothing for now [2020-02-26 06:07:02][25503] Count for 25503 is 94 so we will do nothing for now [2020-02-26 06:07:02][25559] Count for 25559 is 91 so we will do nothing for now [2020-02-26 06:07:02][25594] Count for 25594 is 92 so we will do nothing for now [2020-02-26 06:08:01][25503] Count for 25503 is 94 so we will do nothing for now [2020-02-26 06:08:02][25559] Count for 25559 is 91 so we will do nothing for now [2020-02-26 06:08:02][25594] Count for 25594 is 92 so we will do nothing for now [2020-02-26 06:09:02][25503] Count for 25503 is 94 so we will do nothing for now [2020-02-26 06:09:02][25559] Count for 25559 is 91 so we will do nothing for now [2020-02-26 06:09:02][25594] Count for 25594 is 92 so we will do nothing for now [2020-02-26 06:10:01][25503] Count for 25503 is 94 so we will do nothing for now [2020-02-26 06:10:02][25559] Count for 25559 is 92 so we will do nothing for now [2020-02-26 06:10:02][25594] Count for 25594 is 93 so we will do nothing for now [2020-02-26 06:11:01][25503] Count for 25503 is 94 so we will do nothing for now [2020-02-26 06:11:01][25559] Count for 25559 is 92 so we will do nothing for now [2020-02-26 06:11:02][25594] Count for 25594 is 93 so we will do nothing for now [2020-02-26 06:12:01][25503] Count for 25503 is 94 so we will do nothing for now [2020-02-26 06:12:01][25559] Count for 25559 is 92 so we will do nothing for now [2020-02-26 06:12:01][25594] Count for 25594 is 93 so we will do nothing for now [2020-02-26 06:13:01][25503] Count for 25503 is 94 so we will do nothing for now [2020-02-26 06:13:01][25559] Count for 25559 is 92 so we will do nothing for now [2020-02-26 06:13:01][25594] Count for 25594 is 93 so we will do nothing for now [2020-02-26 06:14:02][25503] Count for 25503 is 94 so we will do nothing for now [2020-02-26 06:14:02][25559] Count for 25559 is 92 so we will do nothing for now [2020-02-26 06:14:02][25594] Count for 25594 is 93 so we will do nothing for now [2020-02-26 06:15:02][25503] Count for 25503 is 94 so we will do nothing for now [2020-02-26 06:15:02][25559] Count for 25559 is 92 so we will do nothing for now [2020-02-26 06:15:02][25594] Count for 25594 is 93 so we will do nothing for now [2020-02-26 06:16:01][25503] Count for 25503 is 95 so we will do nothing for now [2020-02-26 06:16:02][25559] Count for 25559 is 93 so we will do nothing for now [2020-02-26 06:16:02][25594] Count for 25594 is 93 so we will do nothing for now [2020-02-26 06:17:01][25503] Count for 25503 is 95 so we will do nothing for now [2020-02-26 06:17:01][25559] Count for 25559 is 93 so we will do nothing for now [2020-02-26 06:17:01][25594] Count for 25594 is 93 so we will do nothing for now [2020-02-26 06:18:01][25503] Count for 25503 is 95 so we will do nothing for now [2020-02-26 06:18:01][25559] Count for 25559 is 93 so we will do nothing for now [2020-02-26 06:18:01][25594] Count for 25594 is 93 so we will do nothing for now [2020-02-26 06:19:01][25503] Count for 25503 is 95 so we will do nothing for now [2020-02-26 06:19:01][25559] Count for 25559 is 93 so we will do nothing for now [2020-02-26 06:19:01][25594] Count for 25594 is 93 so we will do nothing for now [2020-02-26 06:20:02][25503] Count for 25503 is 95 so we will do nothing for now [2020-02-26 06:20:02][25559] Count for 25559 is 93 so we will do nothing for now [2020-02-26 06:20:02][25594] Count for 25594 is 93 so we will do nothing for now [2020-02-26 06:21:01][25503] Count for 25503 is 95 so we will do nothing for now [2020-02-26 06:21:02][25559] Count for 25559 is 93 so we will do nothing for now [2020-02-26 06:21:02][25594] Count for 25594 is 93 so we will do nothing for now [2020-02-26 06:22:01][25503] Count for 25503 is 96 so we will do nothing for now [2020-02-26 06:22:01][25559] Count for 25559 is 93 so we will do nothing for now [2020-02-26 06:22:01][25594] Count for 25594 is 94 so we will do nothing for now [2020-02-26 06:23:01][25503] Count for 25503 is 96 so we will do nothing for now [2020-02-26 06:23:01][25559] Count for 25559 is 93 so we will do nothing for now [2020-02-26 06:23:01][25594] Count for 25594 is 94 so we will do nothing for now [2020-02-26 06:24:01][25503] Count for 25503 is 96 so we will do nothing for now [2020-02-26 06:24:01][25559] Count for 25559 is 93 so we will do nothing for now [2020-02-26 06:24:02][25594] Count for 25594 is 94 so we will do nothing for now [2020-02-26 06:25:01][25503] Count for 25503 is 96 so we will do nothing for now [2020-02-26 06:25:01][25559] Count for 25559 is 93 so we will do nothing for now [2020-02-26 06:25:01][25594] Count for 25594 is 94 so we will do nothing for now [2020-02-26 06:26:02][25503] Count for 25503 is 96 so we will do nothing for now [2020-02-26 06:26:02][25559] Count for 25559 is 93 so we will do nothing for now [2020-02-26 06:26:02][25594] Count for 25594 is 94 so we will do nothing for now [2020-02-26 06:27:01][25503] Count for 25503 is 96 so we will do nothing for now [2020-02-26 06:27:01][25559] Count for 25559 is 93 so we will do nothing for now [2020-02-26 06:27:01][25594] Count for 25594 is 94 so we will do nothing for now [2020-02-26 06:28:01][25503] Count for 25503 is 96 so we will do nothing for now [2020-02-26 06:28:01][25559] Count for 25559 is 94 so we will do nothing for now [2020-02-26 06:28:02][25594] Count for 25594 is 95 so we will do nothing for now [2020-02-26 06:29:01][25503] Count for 25503 is 96 so we will do nothing for now [2020-02-26 06:29:01][25559] Count for 25559 is 94 so we will do nothing for now [2020-02-26 06:29:01][25594] Count for 25594 is 95 so we will do nothing for now [2020-02-26 06:30:02][25503] Count for 25503 is 96 so we will do nothing for now [2020-02-26 06:30:02][25559] Count for 25559 is 94 so we will do nothing for now [2020-02-26 06:30:02][25594] Count for 25594 is 95 so we will do nothing for now [2020-02-26 06:31:02][25503] Count for 25503 is 96 so we will do nothing for now [2020-02-26 06:31:02][25559] Count for 25559 is 94 so we will do nothing for now [2020-02-26 06:31:02][25594] Count for 25594 is 95 so we will do nothing for now [2020-02-26 06:32:01][25503] Count for 25503 is 96 so we will do nothing for now [2020-02-26 06:32:01][25559] Count for 25559 is 94 so we will do nothing for now [2020-02-26 06:32:01][25594] Count for 25594 is 95 so we will do nothing for now [2020-02-26 06:33:02][25503] Count for 25503 is 96 so we will do nothing for now [2020-02-26 06:33:02][25559] Count for 25559 is 94 so we will do nothing for now [2020-02-26 06:33:02][25594] Count for 25594 is 95 so we will do nothing for now [2020-02-26 06:34:01][25503] Count for 25503 is 97 so we will do nothing for now [2020-02-26 06:34:01][25559] Count for 25559 is 95 so we will do nothing for now [2020-02-26 06:34:01][25594] Count for 25594 is 95 so we will do nothing for now [2020-02-26 06:35:02][25503] Count for 25503 is 97 so we will do nothing for now [2020-02-26 06:35:02][25559] Count for 25559 is 95 so we will do nothing for now [2020-02-26 06:35:03][25594] Count for 25594 is 95 so we will do nothing for now [2020-02-26 06:36:01][25503] Count for 25503 is 97 so we will do nothing for now [2020-02-26 06:36:01][25559] Count for 25559 is 95 so we will do nothing for now [2020-02-26 06:36:02][25594] Count for 25594 is 95 so we will do nothing for now [2020-02-26 06:37:01][25503] Count for 25503 is 97 so we will do nothing for now [2020-02-26 06:37:01][25559] Count for 25559 is 95 so we will do nothing for now [2020-02-26 06:37:01][25594] Count for 25594 is 95 so we will do nothing for now [2020-02-26 06:38:01][25503] Count for 25503 is 97 so we will do nothing for now [2020-02-26 06:38:02][25559] Count for 25559 is 95 so we will do nothing for now [2020-02-26 06:38:02][25594] Count for 25594 is 95 so we will do nothing for now [2020-02-26 06:39:01][25503] Count for 25503 is 97 so we will do nothing for now [2020-02-26 06:39:01][25559] Count for 25559 is 95 so we will do nothing for now [2020-02-26 06:39:01][25594] Count for 25594 is 95 so we will do nothing for now [2020-02-26 06:40:02][25503] Count for 25503 is 98 so we will do nothing for now [2020-02-26 06:40:02][25559] Count for 25559 is 95 so we will do nothing for now [2020-02-26 06:40:02][25594] Count for 25594 is 96 so we will do nothing for now [2020-02-26 06:41:02][25503] Count for 25503 is 98 so we will do nothing for now [2020-02-26 06:41:02][25559] Count for 25559 is 95 so we will do nothing for now [2020-02-26 06:41:02][25594] Count for 25594 is 96 so we will do nothing for now [2020-02-26 06:42:01][25503] Count for 25503 is 98 so we will do nothing for now [2020-02-26 06:42:01][25559] Count for 25559 is 95 so we will do nothing for now [2020-02-26 06:42:01][25594] Count for 25594 is 96 so we will do nothing for now [2020-02-26 06:43:02][25503] Count for 25503 is 98 so we will do nothing for now [2020-02-26 06:43:02][25559] Count for 25559 is 95 so we will do nothing for now [2020-02-26 06:43:02][25594] Count for 25594 is 96 so we will do nothing for now [2020-02-26 06:44:01][25503] Count for 25503 is 98 so we will do nothing for now [2020-02-26 06:44:01][25559] Count for 25559 is 95 so we will do nothing for now [2020-02-26 06:44:01][25594] Count for 25594 is 96 so we will do nothing for now [2020-02-26 06:45:02][25503] Count for 25503 is 98 so we will do nothing for now [2020-02-26 06:45:02][25559] Count for 25559 is 95 so we will do nothing for now [2020-02-26 06:45:02][25594] Count for 25594 is 96 so we will do nothing for now [2020-02-26 06:46:02][25503] Count for 25503 is 98 so we will do nothing for now [2020-02-26 06:46:02][25559] Count for 25559 is 96 so we will do nothing for now [2020-02-26 06:46:02][25594] Count for 25594 is 97 so we will do nothing for now [2020-02-26 06:47:01][25503] Count for 25503 is 98 so we will do nothing for now [2020-02-26 06:47:01][25559] Count for 25559 is 96 so we will do nothing for now [2020-02-26 06:47:01][25594] Count for 25594 is 97 so we will do nothing for now [2020-02-26 06:48:01][25503] Count for 25503 is 98 so we will do nothing for now [2020-02-26 06:48:02][25559] Count for 25559 is 96 so we will do nothing for now [2020-02-26 06:48:02][25594] Count for 25594 is 97 so we will do nothing for now [2020-02-26 06:49:01][25503] Count for 25503 is 98 so we will do nothing for now [2020-02-26 06:49:01][25559] Count for 25559 is 96 so we will do nothing for now [2020-02-26 06:49:01][25594] Count for 25594 is 97 so we will do nothing for now [2020-02-26 06:50:02][25503] Count for 25503 is 98 so we will do nothing for now [2020-02-26 06:50:02][25559] Count for 25559 is 96 so we will do nothing for now [2020-02-26 06:50:02][25594] Count for 25594 is 97 so we will do nothing for now [2020-02-26 06:51:02][25503] Count for 25503 is 98 so we will do nothing for now [2020-02-26 06:51:02][25559] Count for 25559 is 96 so we will do nothing for now [2020-02-26 06:51:02][25594] Count for 25594 is 97 so we will do nothing for now [2020-02-26 06:52:01][25503] Count for 25503 is 99 so we will do nothing for now [2020-02-26 06:52:01][25559] Count for 25559 is 97 so we will do nothing for now [2020-02-26 06:52:01][25594] Count for 25594 is 97 so we will do nothing for now [2020-02-26 06:53:01][25503] Count for 25503 is 99 so we will do nothing for now [2020-02-26 06:53:01][25559] Count for 25559 is 97 so we will do nothing for now [2020-02-26 06:53:02][25594] Count for 25594 is 97 so we will do nothing for now [2020-02-26 06:54:01][25503] Count for 25503 is 99 so we will do nothing for now [2020-02-26 06:54:01][25559] Count for 25559 is 97 so we will do nothing for now [2020-02-26 06:54:01][25594] Count for 25594 is 97 so we will do nothing for now [2020-02-26 06:55:02][25503] Count for 25503 is 99 so we will do nothing for now [2020-02-26 06:55:02][25559] Count for 25559 is 97 so we will do nothing for now [2020-02-26 06:55:02][25594] Count for 25594 is 97 so we will do nothing for now [2020-02-26 06:56:01][25503] Count for 25503 is 99 so we will do nothing for now [2020-02-26 06:56:01][25559] Count for 25559 is 97 so we will do nothing for now [2020-02-26 06:56:02][25594] Count for 25594 is 97 so we will do nothing for now [2020-02-26 06:57:01][25503] Count for 25503 is 99 so we will do nothing for now [2020-02-26 06:57:01][25559] Count for 25559 is 97 so we will do nothing for now [2020-02-26 06:57:01][25594] Count for 25594 is 97 so we will do nothing for now [2020-02-26 06:58:01][25503] Count for 25503 is 100 so we will do nothing for now [2020-02-26 06:58:01][25559] Count for 25559 is 97 so we will do nothing for now [2020-02-26 06:58:01][25594] Count for 25594 is 98 so we will do nothing for now [2020-02-26 06:59:02][25503] Count for 25503 is 100 so we will do nothing for now [2020-02-26 06:59:02][25559] Count for 25559 is 97 so we will do nothing for now [2020-02-26 06:59:02][25594] Count for 25594 is 98 so we will do nothing for now [2020-02-26 07:00:02][25503] Count for 25503 is 100 so we will do nothing for now [2020-02-26 07:00:02][25559] Count for 25559 is 97 so we will do nothing for now [2020-02-26 07:00:02][25594] Count for 25594 is 98 so we will do nothing for now [2020-02-26 07:01:02][25503] Count for 25503 is 100 so we will do nothing for now [2020-02-26 07:01:02][25559] Count for 25559 is 97 so we will do nothing for now [2020-02-26 07:01:02][25594] Count for 25594 is 98 so we will do nothing for now [2020-02-26 07:02:01][25503] Count for 25503 is 100 so we will do nothing for now [2020-02-26 07:02:01][25559] Count for 25559 is 97 so we will do nothing for now [2020-02-26 07:02:01][25594] Count for 25594 is 98 so we will do nothing for now [2020-02-26 07:03:02][25503] Count for 25503 is 100 so we will do nothing for now [2020-02-26 07:03:02][25559] Count for 25559 is 97 so we will do nothing for now [2020-02-26 07:03:02][25594] Count for 25594 is 98 so we will do nothing for now [2020-02-26 07:04:01][25503] Count for 25503 is 100 so we will do nothing for now [2020-02-26 07:04:01][25559] Count for 25559 is 98 so we will do nothing for now [2020-02-26 07:04:01][25594] Count for 25594 is 99 so we will do nothing for now [2020-02-26 07:05:02][25503] Count for 25503 is 100 so we will do nothing for now [2020-02-26 07:05:02][25559] Count for 25559 is 98 so we will do nothing for now [2020-02-26 07:05:02][25594] Count for 25594 is 99 so we will do nothing for now [2020-02-26 07:06:02][25503] Count for 25503 is 100 so we will do nothing for now [2020-02-26 07:06:02][25559] Count for 25559 is 98 so we will do nothing for now [2020-02-26 07:06:02][25594] Count for 25594 is 99 so we will do nothing for now [2020-02-26 07:07:01][25503] Count for 25503 is 100 so we will do nothing for now [2020-02-26 07:07:01][25559] Count for 25559 is 98 so we will do nothing for now [2020-02-26 07:07:01][25594] Count for 25594 is 99 so we will do nothing for now [2020-02-26 07:08:01][25503] Count for 25503 is 100 so we will do nothing for now [2020-02-26 07:08:01][25559] Count for 25559 is 98 so we will do nothing for now [2020-02-26 07:08:01][25594] Count for 25594 is 99 so we will do nothing for now [2020-02-26 07:09:01][25503] Count for 25503 is 100 so we will do nothing for now [2020-02-26 07:09:01][25559] Count for 25559 is 98 so we will do nothing for now [2020-02-26 07:09:01][25594] Count for 25594 is 99 so we will do nothing for now [2020-02-26 07:10:01][25503] There are more then 100 entries for root 25503 0.1 3.1 438516 92108 ? Ssl Feb25 1:20 /usr/local/bin/openfortivpn z.z.z.z 12443 -u -p --trusted-cert --persistent 60. We are going to kill pid >>>25503<<< [2020-02-26 07:10:07][25559] Count for 25559 is 99 so we will do nothing for now [2020-02-26 07:10:07][25594] Count for 25594 is 99 so we will do nothing for now [2020-02-26 07:11:01][25559] Count for 25559 is 99 so we will do nothing for now [2020-02-26 07:11:02][25594] Count for 25594 is 99 so we will do nothing for now [2020-02-26 07:12:02][25559] Count for 25559 is 99 so we will do nothing for now [2020-02-26 07:12:02][25594] Count for 25594 is 99 so we will do nothing for now [2020-02-26 07:13:01][25559] Count for 25559 is 99 so we will do nothing for now [2020-02-26 07:13:01][25594] Count for 25594 is 99 so we will do nothing for now [2020-02-26 07:14:02][25559] Count for 25559 is 99 so we will do nothing for now [2020-02-26 07:14:02][25594] Count for 25594 is 99 so we will do nothing for now [2020-02-26 07:15:02][25559] Count for 25559 is 99 so we will do nothing for now [2020-02-26 07:15:02][25594] Count for 25594 is 99 so we will do nothing for now [2020-02-26 07:16:01][25559] Count for 25559 is 99 so we will do nothing for now [2020-02-26 07:16:01][25594] Count for 25594 is 100 so we will do nothing for now [2020-02-26 07:17:01][25559] Count for 25559 is 99 so we will do nothing for now [2020-02-26 07:17:01][25594] Count for 25594 is 100 so we will do nothing for now [2020-02-26 07:18:02][25559] Count for 25559 is 99 so we will do nothing for now [2020-02-26 07:18:02][25594] Count for 25594 is 100 so we will do nothing for now [2020-02-26 07:19:01][25559] Count for 25559 is 99 so we will do nothing for now [2020-02-26 07:19:01][25594] Count for 25594 is 100 so we will do nothing for now [2020-02-26 07:20:02][25559] Count for 25559 is 99 so we will do nothing for now [2020-02-26 07:20:02][25594] Count for 25594 is 100 so we will do nothing for now [2020-02-26 07:21:02][25559] Count for 25559 is 99 so we will do nothing for now [2020-02-26 07:21:02][25594] Count for 25594 is 100 so we will do nothing for now [2020-02-26 07:22:01][25559] Count for 25559 is 100 so we will do nothing for now [2020-02-26 07:22:01][25594] Count for 25594 is 100 so we will do nothing for now [2020-02-26 07:23:01][25559] Count for 25559 is 100 so we will do nothing for now [2020-02-26 07:23:01][25594] There are more then 100 entries for root 25594 0.1 3.1 438516 92100 ? Ssl Feb25 1:17 /usr/local/bin/openfortivpn x.x.x.x 12443 -u -p --trusted-cert --persistent 59. We are going to kill pid >>>25594<<< [2020-02-26 07:24:02][25559] Count for 25559 is 100 so we will do nothing for now [2020-02-26 07:25:01][25559] Count for 25559 is 100 so we will do nothing for now [2020-02-26 07:26:02][25559] Count for 25559 is 100 so we will do nothing for now [2020-02-26 07:27:01][25559] Count for 25559 is 100 so we will do nothing for now [2020-02-26 07:28:02][25559] Count for 25559 is 100 so we will do nothing for now [2020-02-26 07:29:01][25559] There are more then 100 entries for root 25559 0.1 3.1 429432 91296 ? Ss Feb25 1:20 /usr/local/bin/openfortivpn y.y.y.y 12443 -u -p --trusted-cert --persistent 60. We are going to kill pid >>>25559<<<

DimitriPapadopoulos commented 4 years ago

Indeed openfortivpn execs pppd and at first sight the socket leak could be related to pppd.

Also I was suggesting to remove --persistent as a mere test, not as permanent solution. No worries here.

Finally I have noted why you're using --persistent - which I cannot use at my place because we have a token which generates one time passwords. It would be interesting to have openfortivpn detect when the Fortinet device ends the connection, either explicitly if the Fortinet device actually reports this event or implicitly by having openfortivpn monitor the connection and exit (after a time-out) the connection seems dead. Then you could have systemd restart openfortivpn when needed.

DimitriPapadopoulos commented 4 years ago

On second thought, what I don't understand is that if openfortivpn is able to restart a connection when needed with --persistent, it should also be able to exit in similar circumstances without --persistent. Isn't that what you experience? If not we need to investigate that...

mrbaseman commented 4 years ago

To me it looks as we have several different scenarios happening here:

dovi5988 commented 4 years ago

@DimitriPapadopoulos We have the persistent because of the 8 hour max forced by the Fortinet. If we don't use it when the connection is forced closed by the Fortinet the process doesn't actually stop so systemd does not restart it.

@mrbaseman Traffic for one tunnel is not routed over another as the routes that we get are for specific routes behind the fortinet and not 0.0.0.0.

I will run OpenForti in a screen session with --persistent and log it with tee and see what happens after a period of time.

EDIT1: It's leaking without systemd. Looking at the output I see: WARN: You should not pass the password on the command line. Type it interactively or use a config file instead. INFO: Connected to gateway. INFO: Authenticated. INFO: Remote gateway has allocated a VPN. INFO: Got addresses: [10.174.175.2], ns [0.0.0.0, 0.0.0.0] INFO: negotiation complete INFO: negotiation complete INFO: Interface ppp1 is UP. INFO: Setting new routes... WARN: Route to vpn server exists already. INFO: Adding VPN nameservers... INFO: Tunnel is up and running. INFO: Cancelling threads... INFO: Setting ppp interface down. INFO: Restoring routes... INFO: Removing VPN nameservers... INFO: pppd: The link was terminated by the modem hanging up. INFO: Terminated pppd. INFO: Closed connection to gateway. INFO: Logged out. INFO: Connected to gateway. INFO: Authenticated. INFO: Remote gateway has allocated a VPN. INFO: Got addresses: [10.174.175.2], ns [0.0.0.0, 0.0.0.0] INFO: negotiation complete INFO: negotiation complete INFO: Interface ppp1 is UP. INFO: Setting new routes... WARN: Route to vpn server exists already. INFO: Adding VPN nameservers... INFO: Tunnel is up and running.

There are no timestamps so I can't say for sure but it seems the leak is happening after the log out.

DimitriPapadopoulos commented 4 years ago

Concerning the need for --persistent:

I don't know if it is possible to natively detect the FortiGate time out (8 hours) on the client side - without actively monitoring the connection using pings for example. It would be nice to get openfortivpn to exit - I cannot recall if FortiClient exits in that case. But wait... openfortivpn is able to detect the connection closes since it restarts the loop. Not sure why openfortivpn would restart the loop with --persistent but not exit without --persistent. Are you certain that openfortivpn does not exit and that it is not a systemd configuration issue?

If you have a look at the loop, clearly if the loop restarts with --persistent, then without --persistent openfortivpn should exit:

    do {
        if (run_tunnel(&cfg) != 0)
            ret = EXIT_FAILURE;
        else
            ret = EXIT_SUCCESS;
        if ((cfg.persistent > 0) && (get_sig_received() == 0))
            sleep(cfg.persistent);
    } while ((get_sig_received() == 0) && (cfg.persistent != 0));
dovi5988 commented 4 years ago

@DimitriPapadopoulos I am checking now. From the logs without --persistent the process seems to exit with. DEBUG: Error reading from SSL connection (Protocol violation with EOF). INFO: Cancelling threads... DEBUG: disconnecting INFO: Setting ppp interface down. INFO: Restoring routes... DEBUG: Route to vpn server was not added INFO: Removing VPN nameservers... DEBUG: Waiting for pppd to exit... DEBUG: waitpid: pppd exit status code 16 INFO: pppd: The link was terminated by the modem hanging up. INFO: Terminated pppd. INFO: Closed connection to gateway. DEBUG: server_addr: X.X.X.X DEBUG: server_port: 11443 DEBUG: gateway_addr: X.X.X.X DEBUG: gateway_port: 11443 DEBUG: Setting cipher list to: HIGH:!aNULL:!kRSA:!PSK:!SRP:!MD5:!RC4 DEBUG: Gateway certificate validation failed. DEBUG: Gateway certificate digest found in white list. DEBUG: http_send: GET /remote/logout HTTP/1.1 Host: X.X.X.X:10443 User-Agent: Mozilla/5.0 SV1 Accept: text/plain Accept-Encoding: identity Content-Type: application/x-www-form-urlencoded Cookie: SVPNCOOKIE=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Content-Length: 0

DEBUG: http_receive: HTTP/1.1 200 OK Date: Mon, 02 Mar 2020 15:16:59 GMT Server: xxxxxxxx-xxxxx Set-Cookie: SVPNCOOKIE=; path=/; expires=Sun, 11 Mar 1984 12:00:00 GMT; secure; httponly; Set-Cookie: SVPNNETWORKCOOKIE=; path=/remote/network; expires=Sun, 11 Mar 1984 12:00:00 GMT; secure; httponly Content-Length: 643 Content-Type: text/html; charset=utf-8 X-Frame-Options: SAMEORIGIN Content-Security-Policy: frame-ancestors 'self'; object-src 'none'; script-src 'self' https 'unsafe-eval' 'unsafe-inline'; X-XSS-Protection: 1; mode=block X-Content-Type-Options: nosniff Strict-Transport-Security: max-age=31536000

INFO: Logged out.

Is that expected behavior? It seems this happens every 6 minutes.

When using systemd WITHOUT persistent it seems to exit correctly and restart. (I am not sure why I didn't see this in the past). I can simply use systemd without persistent. My concern is two fold: 1) For others that don't have redundant links. 2) systemd can take 1-3 seconds to restart. If one IP is offline we lose connectivity. 3) We have three connections. Two that are fiber and one cellular as backup. When the thread running to the cellular connection restart all traffic goes through it, increasing latency.

DimitriPapadopoulos commented 4 years ago

Oh, I'm not saying you shouldn't use --persistent, of course you can. Just wanted to make sure openfortivpn exits gracefully when the FortiGate appliance closes the connection - hence --persistent is not an absolute necessity.

If I understand you correctly, you experience openfortivpn exiting after 6 minutes without --persistent and apparently cycling a new loop every 6 minutes with --persistent. Is that correct? That is of course not expected. I am not as familiar with openfortivpn and FortiGate devices as @mrbaseman but I couldn't help noticing these dates that look suspect to me:

Set-Cookie: SVPNCOOKIE=; path=/; expires=Sun, 11 Mar 1984 12:00:00 GMT; secure; httponly;
Set-Cookie: SVPNNETWORKCOOKIE=; path=/remote/network; expires=Sun, 11 Mar 1984 12:00:00 GMT; secure; httponly
<script>function fgt_clear_cookies() {var cookies = document.cookie.split(';');var domain = document.domain.split('.');var new_date='; expires=Thu, 21 Sep 1979 00:00:01 UTC';

Can you run date on your system? And the equivalent command on the FortiGate appliance?

Finally about the memory/socket leak itself, chances are it is in pppd and probably not considered a problem there since it happens when exiting. And it's true that in a "normal" use case it wouldn't be much of a problem, except here we restart pppd every 6 minutes. I don't mean to say the leak should not be fixed. It's just that:

mrbaseman commented 4 years ago

the dates look indeed suspicious. First I thought it's invalidation of a cookie previously set, but for that the time stamp should be "Epoch 0" i.e. 1970, not 1979 or 1984

mrbaseman commented 4 years ago

The idea of the --persistent option is to reopen the connection if possible, i.e. no interaction such as OTP is necessary, each time the remote side closes it. If systemd is also watching the process and restarting it, the two mechanisms are competing. From what I understand now:

That could be a problem with the threads that openfortivpn uses to communicate with pppd. Maybe the cleanup at the end of the io_loop does not work properly, but everything gets cleaned up by the OS when the whole process exits.

dovi5988 commented 4 years ago

Sorry for the sporadic responses. 1) Both the time on the server and the Fortient are correct and are current. 2) Using systemd without persistent seems to be working. It seems openfortinet dies, systemd detects it and starts up again. I don't like that we keep re-connecting in case of an issue but it's not the end of the world. I have not figured out how to get systemd to prefix the logs with the pid. Because we have three tunnels looking at the logs does not help as when I see a restart I don't know which connection it's for. Over all it does seem to die about every 5 minutes. 3) When using openfortivpn with persistent then it seems to loop and have that memory leak.

I am OK with how I have it but if there is a bug somewhere (and it seems like there is) I want to try to help solve the issue.

m33m33 commented 4 years ago

Forking this repos to investiguate this issue. I suspect something with thread management in io.c

m33m33 commented 4 years ago

I have found only false positives yet, and a possible problem related to this issue #540.

My test cases are: A. 2 users, 1 gateway B. 2 users, 2 gateways (1-1) C. 1 user, 2 gateways (1-2)

My current build flags are: make CFLAGS="$CFLAGS -fpie -FPIE -fsanitize=thread -fstack-protector-all -pthread -Wall -O2 -ggdb3 -fno-omit-frame-pointer" \ CXXFLAGS="$CXXFLAGS -fpie -FPIE -fsanitize=thread -fstack-protector-all -pthread -Wall -O2 -ggdb3 -fno-omit-frame-pointer"

1. Added return value verification for pthread_ calls in io.c.

You get the expected error messages when one thread decide to quit (it is not cancellable anymore by the main thread, but can be properly joined). At least it shows now in debug mode, I suggest to keep these tests for now.

You can try it by simulating a network issue with route add -host ip_of_your_vpn_srv reject and watch openfortivpn -v -v ... logs

(...) tunnel was up, simulated network issue
No response to 4 echo-requests
Serial link appears to be disconnected.
Connect time 2.5 minutes.
Sent 1197 bytes, received 3275 bytes.
DEBUG:  pppd ---> gateway (25 bytes)
DEBUG:  pppd ---> gateway (25 bytes)
Connection terminated.
Modem hangup
ERROR:  read: Input/output error
INFO:   Cancelling threads...
DEBUG:  Error canceling if_config_thread: No such process
DEBUG:  Error canceling pty_read_thread: No such process
INFO:   Cleanup, joining threads...

2. Added -fsanitize=leaks compiler flag:

The LeakSanitizer is a little bit impressed by the packets being pushed and poped from the from different threads, with alternance of malloc and free. It is warning about heap_free_after_user but it looks normal to me.

(...) starting openfortivpn, connecting...
Connect: ppp1 <--> /dev/pts/11
INFO:   Got addresses: [X.X.X.X], ns [X.X.X.X]
=================================================================
==5640==ERROR: AddressSanitizer: heap-use-after-free on address 0x60400001c218 at pc 0x55a9c30f64df bp 0x7f12c8795c00 sp 0x7f12c8795bf0
READ of size 8 at 0x60400001c218 thread T3
    #0 0x55a9c30f64de in ssl_read src/io.c:488
    #1 0x7f12cde126da in start_thread (/lib/x86_64-linux-gnu/libpthread.so.0+0x76da)
    #2 0x7f12cdb3b88e in __clone (/lib/x86_64-linux-gnu/libc.so.6+0x12188e)

0x60400001c218 is located 8 bytes inside of 34-byte region [0x60400001c210,0x60400001c232)
freed by thread T2 here:
    #0 0x7f12cea637a8 in __interceptor_free (/usr/lib/x86_64-linux-gnu/libasan.so.4+0xde7a8)
    #1 0x55a9c30f39fb in pppd_write src/io.c:344
    #2 0x7f12cde126da in start_thread (/lib/x86_64-linux-gnu/libpthread.so.0+0x76da)

previously allocated by thread T3 here:
    #0 0x7f12cea63b40 in __interceptor_malloc (/usr/lib/x86_64-linux-gnu/libasan.so.4+0xdeb40)
    #1 0x55a9c30f528b in ssl_read src/io.c:453
    #2 0x7f12cde126da in start_thread (/lib/x86_64-linux-gnu/libpthread.so.0+0x76da)

Thread T3 created by T0 here:
    #0 0x7f12ce9bcd2f in __interceptor_pthread_create (/usr/lib/x86_64-linux-gnu/libasan.so.4+0x37d2f)
    #1 0x55a9c30f6f21 in io_loop src/io.c:660
    #2 0x55a9c3104276 in run_tunnel src/tunnel.c:1087
    #3 0x55a9c30e992e in main src/main.c:566
    #4 0x7f12cda3bb96 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x21b96)

Thread T2 created by T0 here:
    #0 0x7f12ce9bcd2f in __interceptor_pthread_create (/usr/lib/x86_64-linux-gnu/libasan.so.4+0x37d2f)
    #1 0x55a9c30f6e89 in io_loop src/io.c:654
    #2 0x55a9c3104276 in run_tunnel src/tunnel.c:1087
    #3 0x55a9c30e992e in main src/main.c:566
    #4 0x7f12cda3bb96 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x21b96)

SUMMARY: AddressSanitizer: heap-use-after-free src/io.c:488 in ssl_read
Shadow bytes around the buggy address:
  0x0c087fffb7f0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c087fffb800: fa fa fd fd fd fd fd fa fa fa fd fd fd fd fd fa
  0x0c087fffb810: fa fa fd fd fd fd fd fa fa fa fd fd fd fd fd fd
  0x0c087fffb820: fa fa fd fd fd fd fd fa fa fa fd fd fd fd fd fa
  0x0c087fffb830: fa fa fd fd fd fd fd fa fa fa fd fd fd fd fd fa
=>0x0c087fffb840: fa fa fd[fd]fd fd fd fa fa fa fa fa fa fa fa fa
  0x0c087fffb850: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c087fffb860: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c087fffb870: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c087fffb880: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c087fffb890: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==5640==ABORTING
Hangup (SIGHUP)
Modem hangup
Connection terminated.

3. Added -fsanitize=threads compiler flag:

The ThreadSanitizer reports data race conditions, locked mutexes destructions. Looks normal.

(...) tunnel is up
INFO:   Got addresses: [x.x.x.x], ns [x.x.x.x]
INFO:   negotiation complete
==================
WARNING: ThreadSanitizer: data race (pid=3389)
  Write of size 8 at 0x7b0c00007988 by thread T2:
    #0 free <null> (libtsan.so.0+0x2b136)
    #1 pppd_write src/io.c:344 (openfortivpn+0xb5c1)
    #2 <null> <null> (libtsan.so.0+0x296ad)

  Previous read of size 8 at 0x7b0c00007988 by thread T3:
    #0 ssl_read src/io.c:488 (openfortivpn+0xcd68)
    #1 <null> <null> (libtsan.so.0+0x296ad)

  Thread T2 (tid=3411, running) created by main thread at:
    #0 pthread_create <null> (libtsan.so.0+0x2bcee)
    #1 io_loop src/io.c:654 (openfortivpn+0xd1a1)
    #2 run_tunnel src/tunnel.c:1087 (openfortivpn+0x151de)
    #3 main src/main.c:566 (openfortivpn+0x5453)

  Thread T3 (tid=3412, running) created by main thread at:
    #0 pthread_create <null> (libtsan.so.0+0x2bcee)
    #1 io_loop src/io.c:660 (openfortivpn+0xd213)
    #2 run_tunnel src/tunnel.c:1087 (openfortivpn+0x151de)
    #3 main src/main.c:566 (openfortivpn+0x5453)

SUMMARY: ThreadSanitizer: data race (/usr/lib/x86_64-linux-gnu/libtsan.so.0+0x2b136) in free
(...) tunnel was up, simulating network issue
No response to 4 echo-requests
Serial link appears to be disconnected.
Connect time 3.0 minutes.
Sent 1039 bytes, received 3046 bytes.
Connection terminated.
Modem hangup
ERROR:  read: Input/output error
INFO:   Cancelling threads...
INFO:   Cleanup, joining threads...
==================
WARNING: ThreadSanitizer: destroy of a locked mutex (pid=3389)
    #0 pthread_mutex_destroy <null> (libtsan.so.0+0x2c6f9)
    #1 io_loop src/io.c:747 (openfortivpn+0xd64a)
    #2 run_tunnel src/tunnel.c:1087 (openfortivpn+0x151de)
    #3 main src/main.c:566 (openfortivpn+0x5453)

  and:
    #0 pthread_mutex_destroy <null> (libtsan.so.0+0x2c6f9)
    #1 io_loop src/io.c:747 (openfortivpn+0xd64a)
    #2 run_tunnel src/tunnel.c:1087 (openfortivpn+0x151de)
    #3 main src/main.c:566 (openfortivpn+0x5453)

  Location is stack of main thread.

  Location is global '<null>' at 0x000000000000 ([stack]+0x000000024490)

  Mutex M79 (0x7fffb73b1490) created at:
    #0 pthread_mutex_init <null> (libtsan.so.0+0x2c5ad)
    #1 io_loop src/io.c:606 (openfortivpn+0xcfae)
    #2 run_tunnel src/tunnel.c:1087 (openfortivpn+0x151de)
    #3 main src/main.c:566 (openfortivpn+0x5453)

SUMMARY: ThreadSanitizer: destroy of a locked mutex (/usr/lib/x86_64-linux-gnu/libtsan.so.0+0x2c6f9) in __interceptor_pthread_mutex_destroy

4. Running openfortivpn with valgrind

Valgrind reports nothing usable (at least to me, I am not used to it)

=19833== Memcheck, a memory error detector
==19833== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==19833== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==19833== Command: ./openfortivpn --persistent=2 --username=user ip:port
==19833== Parent PID: 540
==19833== 
--19833-- 
--19833-- Valgrind options:
--19833--    --log-file=log-vl3.log
--19833--    --leak-check=full
--19833--    --track-fds=yes
--19833--    --track-origins=yes
--19833--    -v
(...)
--19833-- Discarding syms at 0x7bfc090-0x7c09807 in /lib/x86_64-linux-gnu/libnsl-2.27.so due to munmap()
==19833== 
==19833== FILE DESCRIPTORS: 5 open at exit.
==19833== Open AF_INET socket 4: ip:port <-> ip:port
==19833==    at 0x5AD8EC7: socket (syscall-template.S:78)
==19833==    by 0x115B84: tcp_connect (tunnel.c:484)
==19833==    by 0x115B84: ssl_connect (tunnel.c:785)
==19833==    by 0x1170D0: run_tunnel (tunnel.c:1103)
==19833==    by 0x10CA93: main (main.c:566)
==19833== 
==19833== Open file descriptor 3: /.../log-vl3.log
==19833==    <inherited from parent>
==19833== 
==19833== Open file descriptor 2: /dev/pts/1
==19833==    <inherited from parent>
==19833== 
==19833== Open file descriptor 1: /dev/pts/1
==19833==    <inherited from parent>
==19833== 
==19833== Open file descriptor 0: /dev/pts/1
==19833==    <inherited from parent>
==19833== 
==19833== 
==19833== HEAP SUMMARY:
==19833==     in use at exit: 77,423 bytes in 1,046 blocks
==19833==   total heap usage: 118,672 allocs, 117,626 frees, 74,145,547 bytes allocated
==19833== 
==19833== Searching for pointers to 1,046 not-freed blocks
==19833== Checked 144,128 bytes
==19833== 
==19833== 2,790 bytes in 558 blocks are definitely lost in loss record 180 of 185
==19833==    at 0x4C2FB0F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==19833==    by 0x5A539B9: strdup (strdup.c:42)
==19833==    by 0x113A6B: ipv4_set_split_routes (ipv4.c:881)
==19833==    by 0x113A6B: ipv4_set_tunnel_routes (ipv4.c:979)
==19833==    by 0x115652: on_ppp_if_up (tunnel.c:113)
==19833==    by 0x110AA9: if_config (io.c:560)
==19833==    by 0x579E6DA: start_thread (pthread_create.c:463)
==19833==    by 0x5AD788E: clone (clone.S:95)
==19833== 
==19833== 5,828 (32 direct, 5,796 indirect) bytes in 1 blocks are definitely lost in loss record 182 of 185
==19833==    at 0x4C2FB0F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==19833==    by 0x5243E58: CRYPTO_zalloc (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1)
==19833==    by 0x52A4F2E: OPENSSL_sk_new_reserve (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1)
==19833==    by 0x52B7438: X509_STORE_new (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1)
==19833==    by 0x4E73CF6: SSL_CTX_new (in /usr/lib/x86_64-linux-gnu/libssl.so.1.1)
==19833==    by 0x115F78: ssl_connect (tunnel.c:797)
==19833==    by 0x1170D0: run_tunnel (tunnel.c:1103)
==19833==    by 0x10CA93: main (main.c:566)
==19833== 
==19833== 68,805 (6,224 direct, 62,581 indirect) bytes in 1 blocks are definitely lost in loss record 185 of 185
==19833==    at 0x4C2FB0F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==19833==    by 0x5243E58: CRYPTO_zalloc (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1)
==19833==    by 0x4E76661: SSL_new (in /usr/lib/x86_64-linux-gnu/libssl.so.1.1)
==19833==    by 0x116572: ssl_connect (tunnel.c:922)
==19833==    by 0x1170D0: run_tunnel (tunnel.c:1103)
==19833==    by 0x10CA93: main (main.c:566)
==19833== 
==19833== LEAK SUMMARY:
==19833==    definitely lost: 9,046 bytes in 560 blocks
==19833==    indirectly lost: 68,377 bytes in 486 blocks
==19833==      possibly lost: 0 bytes in 0 blocks
==19833==    still reachable: 0 bytes in 0 blocks
==19833==         suppressed: 0 bytes in 0 blocks
==19833== 
==19833== ERROR SUMMARY: 3 errors from 3 contexts (suppressed: 0 from 0)
==19833== ERROR SUMMARY: 3 errors from 3 contexts (suppressed: 0 from 0)

5. Possible direction for next tests

Destination     Gateway             Genmask         Flags   MSS Window  irtt Iface
172.1.1.0        192.168.2.1        255.255.255.0   UG        0 0  0  ppp1
172.1.1.0        192.168.1.1        255.255.255.0   UG        0 0  0  ppp0

This should be considered fatal since you may route packets to the wrong vpn peer.

Both instances running and tunnels up Kill 1st instance: exit well Kill 2nd instance: it get stuck in "INFO: Closed connection to gateway." sometimes, not all the time

(...) sighup to 2nd
INFO:   Adding VPN nameservers...
INFO:   Tunnel is up and running.
^CINFO:   Cancelling threads...
INFO:   Cleanup, joining threads...
INFO:   Setting ppp interface down.
INFO:   Restoring routes...
INFO:   Removing VPN nameservers...
Hangup (SIGHUP)
Modem hangup
Connect time 0.7 minutes.
Sent 17597 bytes, received 46750 bytes.
Connection terminated.
INFO:   pppd: The link was terminated by the modem hanging up.
INFO:   Terminated pppd.
INFO:   Closed connection to gateway.
█
mrbaseman commented 4 years ago

Hi @dovi5988

thanks for this detailed analysis. I'm not so deeply in the topic of threads and mutexes, but I can shortly comment on your last points:

The results with valrind: Some of the findings seem to be inside of Openssl. Maybe we use the library in a wrong manner and more measures are needed to use it cleanly in conjunction with pthreads, but maybe it's just a leak in the library. We should have a look at the memory we get back from strdup.

About the connections to different vpns: Well, if you connect to several vpns that uses the same networks, you simply have a collision of the network ranges. You can either remove the routes which are already there or refuse to add new ones for the same network range, or you just try to add them and see what happens. It's an unclear situation after all. Apart from that, there is a split routes mode and a tunnel mode in which all traffic is supposed to be routed through the vpn. If such a tunnel is opened first, and you open another vpn connection, the second one travels (with its own ssl encryption layer) through the first one. If you close the first tunnel, the second one can't reach its gateway on the previously established path anymore. The same applies when both tunnels run in split routes mode but share the same network range. The path to one of the tunnels may fall into a route for a subnet that is pushed by the other one, and, as a rule of thumb, you run into trouble if the connections are not closed in the reverse order as they have been opened.

m33m33 commented 4 years ago

Hi @dovi5988 ...

@mrbaseman the test cases were conduced with split tunneling, no tunnel in tunnel issues here.

I suggest that openfortivpn should refuse to overload routes put by its own previous instances (just a warning message for now).

About valgrind reports, I agree and they may well be false positive anyway.

My understanding right now is that @dovi5988 is experiencing issues because of the multiple instances/tunnels, that may occur in a very specific scenario.

Sent with GitHawk