Localproxy in destination mode occupies TCP port 5001

omri-s-electreon commented 1 year ago

Hi all,

We are developing an IoT platform and facing an issue with the localproxy (v3.0.2) in destination mode (on our devices).

Our devices are (Embedded) Linux based and composed of several Linux components: One main CPU/board (running Linux) and multiple secondary CPUs/boards (also running Linux), all connected via LAN (Ethernet switch connecting all of the components). The main Linux component runs a C++ user-space application (using the AWS IoT Device C++ SDK V2) which connects to IoT core via MQTT to publish/subscribe messages. This application (TCP server) also listens for connections from the secondary componenets (TCP clients) on port 5001. The main application is subscribed to IoT core, so when it receives a message about a secure tunnel (opened or keys re-generated), it then launches the localproxy server in destination mode which successfully connects to the tunnel. The application starts localproxy with the following command (setting the access token as an env variable prior to execution):

localproxy -r eu-west-1 -d SSH=127.0.0.1:22

We are using this feature to allow remote and secure SSH to our deployed devices and it seems to work well (stable & secure).

We started noticing that sometimes the secondary components are unable to connect to the main one (via TCP socket) and noticed that when this happens - The main application is unable to bind to the socket (listening on port 5001) and only when localproxy is running (on the main device, alongside the main aplication).

We ran the following command in different scenarios:

When localproxy is NOT running, the main component (192.168.1.20) is listening on port 5001 for connections and a secondary component (192.168.1.121) connected successfully to port 5001 and is now communicating with the main one from port 41764.
```
# netstat -a -n -p -l | grep 5001
tcp        0      0 0.0.0.0:5001            0.0.0.0:*               LISTEN      19239/main-app 
tcp      384      0 192.168.1.20:5001       192.168.1.121:41764     ESTABLISHED 19239/main-app   <----- All good!
```
After disconnecting the secondary component (disconnecting the network cable) and starting localproxy (while the main app is running), then connecting the secondary once again (with the cable) - the main component (192.168.1.20) is still listening on port 5001 for connections and the secondary component (192.168.1.121) is listed as connected BUT the localproxy server is now handling it (how come???).
```
# netstat -a -n -p -l | grep 5001
tcp        0      0 0.0.0.0:5001            0.0.0.0:*               LISTEN      19239/main-app 
tcp   763072      0 192.168.1.20:5001       192.168.1.121:41764     ESTABLISHED 28853/localproxy    <----- How did this change suddently?
```

When we restart our main application while localproxy is running we get the following output - Now localproxy is the one listening on port 5001! IN this case the main application cannot bind to the socket with port 5001 as it's occupied by localproxy.

# netstat -a -n -p -l | grep 5001
tcp        0      0 0.0.0.0:5001            0.0.0.0:*               LISTEN      28853/localproxy    <----- Why is localproxy listening on 5001?
tcp   763072      0 192.168.1.20:5001       192.168.1.121:41764     ESTABLISHED  28853/localproxy   <----- Why is the client communicating with localproxy?

Why is localproxy hogging port 5001? Is there a way to start it to use a different port? We'll be happy for an explanation of this behaviour.

Thanks,

Omri

omri-s-electreon commented 1 year ago

Any comment(s)?

RogerZhongAWS commented 1 year ago

Hello and thanks for reporting this issue. Unfortunately, we do not know much more than you do at the moment. Localproxy on destination mode is simply supposed to function as yet another TCP client, so it is strange to see it overstepping its boundaries so to speak. Some ad hoc testing I have done with opening a random port, listening to it via netcat, and establishing a connection while local proxy is running and handling an active SSH session has so far yielded no results.

Do you have trace logs that may help us dive further into the issue? Specifically I want to look out for things like

Attempting to establish tcp socket connection to:
Resolving destination host:
Resolved destination host to IP:
and any other logs that may indicate some faulty retry logic being triggered/error handling

omri-s-electreon commented 1 year ago

Hi, Although time had past we do have a short update as we found the cause for this behaviour. Our (Linux, user space) application starts the localproxy "daemon" as a child process (using std::system(...)) and detaches from it. As localproxy is the child process of the application, it will hold all file descriptiors (files, sockets, devices etc.) opened by the parent at that point of forking. As the parent detaches itself, it can stop & start without affecting the localproxy process, so when the app starts and tries to recover the file descriptiors (i.e. connect to a socket), there is already one open by localproxy, inherited from the parent. To solve this - we are now starting localproxy as an actuall background daemon which does not have the parent-child relations as before so file descriptiores are not shared.

Not sure this the correct explanation, but we definitely managed to persuade ourselves of it ;-)

Thanks,

Omri

aws-samples / aws-iot-securetunneling-localproxy

Localproxy in destination mode occupies TCP port 5001 #127