CESNET / libnetconf

C NETCONF library
Other
113 stars 84 forks source link

libnetconf in deadlock during handshake #163

Open ntadas opened 8 years ago

ntadas commented 8 years ago

Hi

I have lib netconf configure with ssh disabled and TLS disabled and I'm using netopeer to connect to the server.

Sometimes during the capability exchange the server enters a endless loop. This endless look is inside the method nc_session_read_until So loop starts with if (session->fd_input != -1) and its able to read: `<?xml version="1.0" encoding="UTF-8"?>

urn:ietf:params:netconf:base:1.0 urn:ietf:params:netconf:base:1.1 .some specific model capabilities...... fd_input, &(buf[rd]), 1); if (c == -1) { if (errno == EAGAIN) { usleep (NC_READ_SLEEP); continue; } ` so it will continue to try to read data and it will always continue without any break condition. Actually I have 2 issues here: 1- the infinity loop, I think this should by fixed 2- why the read fails in the middle of the capabilities exchange, this I don't have any idea. Best Regards
michalvasko commented 8 years ago

Hi, what is your setup again? You have libnetconf (the newest version I assume) with SSH and TLS disabled and are using netopeer-cli to connect to netopeer-server? You would not be able to compile netopeer-server with libnetconf that has both SSH and TLS disabled, so please elaborate, thank you.

Regards, Michal

ntadas commented 8 years ago

Hi I'm using the latest code from libnetcon, dowloaded yesterday night I have my own server, following the instructions on the libnetconf site (libnetconf compiled without TLS and without SSH) I'm using a custom database. I have netopeer-cli as a client.

most of the times I'm able to connect to the server do the get, get-config, edit-config etc... but sometime I have the problem described above.

michalvasko commented 8 years ago

Hi, in that case it is quite difficult for us to help you, we can only guess what the problem might be. But it loops indefinitely because it awaits more data on a non-blocking socket, which are lost somewhere (or never sent), it seems. I don't think I can help you more.

Regards, Michal

ntadas commented 8 years ago

Hi

but shouldn't this loop have a timeout? what can I provided you more so that you can try to help in this issue?

Regards

rkrejci commented 8 years ago

Where the file descriptor came from? In the case the libnetconf doesn't have SSH nor TLS I guess you have some standalone SSH/TLS server that resend data through this file descriptor to your NETCONF server (libnetconf), right? You should investigate this connection.

ntadas commented 8 years ago

yes, I'm using the ssh from my linux machine and I have a small ssh subsystem that only connects the input to the output and vice versa, so that the client and the server can talk. For this test I'm running the server and the client in the same machine, so I'm doing an ssh to localhost and the subsystem is talking with the server via afunix. I don't think the problem is in the connect, since is a local connection (but I'll investigate this also).

Independently from what is causing this (of course I still need to find it) I think the server shouldn't stay in an infinity loop. when this happens the connection thread will be blocked forever and no one else will be able to connect to the server.

rkrejci commented 8 years ago

I actually agrre. The thing is that the timeout in nc_session_recv_*() functions is intended for waiting for data. And here part of the data already came. But the client (SSH subsystem in your case) did not sent a complete NETCONF message. It is actually kind of DoS attack, which comes from an authenticated client. The problem with any timeout here is the case of a slow connections (satelite) where we can have quite a big delay here so the timeout must be longer than timeouts in nc_session_recv_*() functions..

So, our proposed solution is to add a separate timeout, 30 sec by default, configurable via configure script (so constant for the compiled libnetconf). The timeout is reset whenever a data is received (so the situation can repeat during receiving a single message and the delay can be much longer, but in that case it is the problem of the connection).

What do you think?

ntadas commented 8 years ago

Seems reasonable. I'll continue my investigation to try to find why the message stops before it's finished. When I find something I'll post it here. thanks

ntadas commented 8 years ago

I've found the issue of the missing data: its was a problem in my application, in resume a thread was blocked and we couldn't receive all the data. The infinite loop is not so critical now for me, but nevertheless I think it should be fixed if possible to avoid other issues. Thanks for your fast support.

Regards