Improved connection errors

fpagliughi commented 4 years ago

A problem I've been hearing about - and have hit myself a few times - is trying to figure out why a secure connection was refused to a remote broker. There are two distinct errors that are both reported by the library as the same thing (basically, "connection refused"):

The client and broker are unable to create the secure SSL/TLS connection (bad certificates, etc)
The underlying connection is established, but then the broker doesn't like some parameter in the CONNECT packet and immediately drops the connection.

The second one is common with a number of web services that aren't fully compliant with the protocol (AWS, Azure, etc). But when hit, most people assume it's a problem authenticating the secure connection and waste time there trying to figure out the wrong problem.

The only way I've been able to distinguish the two is by looking through the logs. But it would be great if there were separate errors back from the library for these things.

keysight-daryl commented 4 years ago

+1 - spend a ton of time diagnosing ambiguous connection issues

icraggs commented 4 years ago

You can get the TLS error messages by setting the ssl_error_cb function pointer in the SSL options structure. If you don't get any error messages from that, then the TLS negotiation has succeeded.

I agree it could be a good idea to differentiate between a TCP, TLS (and probably websocket) connection failures in the error code information. I think we were thinking that the TLS error callback would cover that.

On services returning error codes in the connack, or not. As the writer of a service, I might decide I'd rather not give out exact information about the error in case I'm aiding a malicious hacking attempt.

fpagliughi commented 3 years ago

I received this Issue from a user of one of my MQTT apps:

The error messages from [the MQTT app] for common failure scenarios are quite vague. A more precise error message or error code would be helpful in diagnosing issues in the field.

On bad credentials
Unable to connect to MQTT broker: [-1] TCP connect completion failure
On invalid url
Unable to connect to MQTT broker: [-1] TCP connect completion failure
On Network disconnect
Unable to connect to MQTT broker: General failure
On DNS resolution errors
Unable to connect to MQTT broker: General failure

I'm not sure of the best way to proceed (Lots more error return codes? A thread-local type of Paho errno? etc). But I do agree that if we can provide some better details all around, it would be really helpful.

icraggs commented 3 years ago

One thing to check on the bad credentials error is what the behaviour of the broker is. If it just chops the TCP connection, then you're not going to get any more information. A broker MIGHT return an appropriate return code in the connack, but it's not obliged to, it's within its rights to terminate the TCP connection. That applies to other connack return codes too.

fpagliughi commented 3 years ago

Ah. Yeah. Can't wait until we can all move to v5! :-)

icraggs commented 3 years ago

That doesn't necessarily change with V5. It can be considered an exposure of information to say that the userid and password are wrong for instance, aiding hacking attempts.

There is already a message field in the failureData structure which provides some description. If there were a connack return code returned from the broker, then this message field should already be filled out with "CONNACK return code" so I suspect the broker is not sending back the connack.

The message field could be used to include more accurate information, about TLS errors, for instance. The protocol trace does include all the needed info, so it's a matter of making sure its included.

fishkeeper87 commented 2 years ago

You can get the TLS error messages by setting the ssl_error_cb function pointer in the SSL options structure. If you don't get any error messages from that, then the TLS negotiation has succeeded.

I agree it could be a good idea to differentiate between a TCP, TLS (and probably websocket) connection failures in the error code information. I think we were thinking that the TLS error callback would cover that.

On services returning error codes in the connack, or not. As the writer of a service, I might decide I'd rather not give out exact information about the error in case I'm aiding a malicious hacking attempt.

Is there any good example online or in the tests that shows how to use the function callback ssl_error_cb? I'm new to using openssl and haven't really found anything yet but will keep looking. I can connect to my local mosquitto broker using TLS 1.2 Mutual Auth using mosquitto_pub iwth my self-signed certs, but I cannot connect with the paho_cs_pub.

paho_cs_pub -h 192.168.3.165 -m "test" -t tcusim/89011703278600892767/voice -p 8883 --insecure --cafile ca-certificates-local.crt --cert 89011703278600892767wIntcopy.crt --key 89011703278600892767.crt --trace protocol

Thanks!

Trace : 3, ========================================================= Trace : 3, Trace Output Trace : 3, Product name: Eclipse Paho Synchronous MQTT C Client Library Trace : 3, Version: 1.3.0 Trace : 3, Build level: Tue Jan 4 15:00:22 CST 2022 Trace : 3, OpenSSL version: OpenSSL 1.1.1 11 Sep 2018 Trace : 3, OpenSSL flags: compiler: gcc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -Wa,--noexecstack -g -O2 -fdebug-prefix-map=/build/openssl-Flav1L/openssl-1.1.1=. -fstack-protector-strong -Wformat -Werror=format-security -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAES_ASM -DVPAES_ASM -DBSAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX2 Trace : 3, OpenSSL build timestamp: built on: Mon Aug 23 17:02:39 2021 UTC Trace : 3, OpenSSL platform: platform: debian-amd64 Trace : 3, OpenSSL directory: OPENSSLDIR: "/usr/lib/ssl" Trace : 3, /proc/version: Linux version 5.4.0-91-generic (buildd@lgw01-amd64-024) (gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)) #102~18.04.1-Ubuntu SMP Thu Nov 11 14:46:36 UTC 2021

Trace : 3, ========================================================= Trace : 4, 20220104 154701.222 3 paho-cs-pub -> CONNECT cleansession: 1 (0) Trace : 5, 20220104 154701.232 waitfor unexpectedly is NULL for client paho-cs-pub, packet_type 2, timeout 29880 Trace : 4, 20220104 154702.241 3 paho-cs-pub -> CONNECT cleansession: 1 (0) Trace : 5, 20220104 154702.241 waitfor unexpectedly is NULL for client paho-cs-pub, packet_type 2, timeout 28861

fpagliughi commented 11 months ago

Looking at the logs from the C library, it seems like the useful information is being determined and logged, but not passed to the caller. Much of the info I was thinking about would be on the failure before or during the connection attempt itself.

I would love to know if the failure was one of these:

Address resolution failure (unknown host)
Socket connect TCP failure (nothing listening on host:port)
- Maybe separate this by common TCP errors: ECONNREFUSED, ENETUNREACH, ETIMEDOUT
SSL/TLS error, separate from TCP error. (Caller should add SSL callback for details)
Timeout waiting for CONNACK (something listening on the port, but maybe it's not an MQTT broker?)
Server abruptly disconnected after receiving CONNECT packet. (i.e. non-confirming AWS doesn't support something you requested and just decided to hang up on you)

That sort of thing.

I assume this can be done in a non-breaking-API fashion by adding a whole bunch of new return codes. With a C int we still have room for thousands of new codes. :-)

eclipse-paho / paho.mqtt.c

Improved connection errors #937