eclipse-mosquitto / mosquitto

Eclipse Mosquitto - An open source MQTT broker
https://mosquitto.org
Other
9.16k stars 2.41k forks source link

Socket error on client <unknown>, disconnecting. ARMv7 subscriber only #1385

Open nickper opened 5 years ago

nickper commented 5 years ago

Currently I'm working on a project where i need to use MQTT on an ARMv7 and i686 device. The current problem is that specificly on the ARMv7 device some problems arise.

i am running on both of the devices Debian 7 Wheezy.

When i try to connect the ARMv7 device to the broker it seems to connect but doesn't receive anything at all. The broker returns the following

1566312615: New connection from 192.168.1.40 on port 1883. 1566312616: Socket error on client , disconnecting. 1566312616: New connection from 192.168.1.40 on port 1883. 1566312616: New client connected from 192.168.1.40 as mosq/qEidBqNY1Kx74MR3Ia (p2, c1, k60). 1566312616: No will message specified. 1566312616: Sending CONNACK to mosq/qEidBqNY1Kx74MR3Ia (0, 0)

The I686 devices show the following notice

1566313930: New connection from 192.168.1.20 on port 1883. 1566313930: New client connected from 192.168.1.20 as mosq/Y0ecobRZo5SXfWh1J1 (p2, c1, k60). 1566313930: No will message specified. 1566313930: Sending CONNACK to mosq/Y0ecobRZo5SXfWh1J1 (0, 0) 1566313930: Received SUBSCRIBE from mosq/Y0ecobRZo5SXfWh1J1 1566313930: diagnostics (QoS 1) 1566313930: mosq/Y0ecobRZo5SXfWh1J1 1 diagnostics 1566313930: Sending SUBACK to mosq/Y0ecobRZo5SXfWh1J1

after little research i noticed that the broker does not forward anything to the ARMv7 device. i don't know it this is a bug of something else. Both devices use the same codebase, but are built with seperate compilers.

//mosquitttomqtt.h
class Handler;

struct Payload
{
    uint32_t id;
    std::string topic;
    int64_t counter;

    Payload();
    Payload(uint32_t id, std::string topic, int counter);

};

class mosquittoMQTT : public mosqpp::mosquittopp
{
public:
    mosquittoMQTT();
    virtual ~mosquittoMQTT();

    bool Initialise(std::string broker, std::string topic, int qos, Handler* handler);
    void Deinitialise();

    /// Publish data to MQTT topic
    void MQTTPublish(const std::string& topic, const Payload& payload, int qos = 0, bool retain = false);

    /// handler for incomig data
    virtual void on_message(const struct mosquitto_message* message) override;

private:
    Handler* subscriber = nullptr;
    std::string topic = "default";
    std::string broker = "127.0.0.1";

};
//mosquittomqtt.cpp
#include "mosquittomqtt.h"
#include "handler.h"

///constructor
mosquittoMQTT::mosquittoMQTT()
{}

mosquittoMQTT::~mosquittoMQTT()
{}

/// Initialise this class
bool mosquittoMQTT::Initialise(std::string broker, std::string Topic, int qos, Handler* handler)
{
    // initialise mosquitto
    mosqpp::lib_init();
    loop_start();

    this->topic = Topic;
    this->subscriber = handler;
    this->broker = broker;

    int result = connect(this->broker.c_str());
    if (result != MOSQ_ERR_SUCCESS)
    {
        std::cout << "Error connecting to MQTT Broker. Error code " << mosquitto_strerror(result) << std::endl;
        mosqpp::lib_cleanup();
        return false;
    }
    if (this->subscriber != nullptr)
    {
        if (this->subscribe(nullptr, topic.c_str(), qos))
        {
            std::cout << "Error initialising MQTTSubscriber" << std::endl;
            disconnect();
            mosqpp::lib_cleanup();
            return false;
        }
    }
    return true;
}

/// Deinitialise this class
void mosquittoMQTT::Deinitialise()
{
    disconnect();
    loop_stop();
    mosqpp::lib_cleanup();
}

/// Publish data on MQTT
void mosquittoMQTT::MQTTPublish(const std::string& topic, const Payload& payload, int qos, bool retain)
{
    // Publish to MQTT broker
    int result = publish(nullptr, topic.c_str(), sizeof(payload), &payload, qos, true);
    if (result != MOSQ_ERR_SUCCESS)
    {
        std::cout << "Error publishing on MQTT. Return code " << result << std::endl;
        if (result == MOSQ_ERR_NO_CONN)
        {
            std::cout << "Trying to reconnect" << std::endl;
            reconnect_async();
        }
    }
}

/// Handle incoming messages
void mosquittoMQTT::on_message(const struct mosquitto_message* message)
{
    //std::cout << message->payload << std::endl;
    /// make sure message topic and payload are copied!!!
    if (this->subscriber)
    {
        //std::cout << message->payloadlen << std::endl;
        Payload* payload = (struct Payload*)message->payload;
        //std::cout << payload->counter << std::endl;
        this->subscriber->ReceivedIntegerValue(payload->id, payload->counter);
    }
}

Payload::Payload(uint32_t id, std::string topic, int counter)
    : id(id)
    , counter(counter)
    , topic(topic)
{
}

Payload::Payload()
    : id(0)
    , counter(0)
    , topic("default")
{
}

Thanks in advance. Nick

karlp commented 5 years ago

You never send a subscribe. I'd check the way you are trying to handle the "has the subscriber been initialized yet?"

nickper commented 5 years ago

I do, it is only hidden in a If statement. and i know that this method works because i got subscribtion messages when i subscribe with my i686 device.

if (this->subscribe(nullptr, topic.c_str(), qos))

nickper commented 5 years ago

to give some more information, I build mosquitto from source with a ARM toolchain on a i686 VM with the following command

make WITH_TLS=no WITH_DOCS=no WITH_BUNDLED_DEPS=no

Furthermore i checked the traffic through wireshark, en encountered something weird. this is a working subscribe connection. image the problem is that my arm device doesn't send these subscribe request messages. while this->subscribe(nullptr, topic.c_str(), qos) does return no Error

ralight commented 5 years ago

Can I check, if you're building from source I presume you're on version 1.6.4? Is that correct?

nickper commented 5 years ago

that is correct

karlp commented 5 years ago

Yes, I wouldn't expect a subscrube request in wireshark as it's not shown int eh broker logs either. Are you sure you actually make the subscribe call? Add an else clause so you get a print regardless?

nickper commented 5 years ago

As far as i can see it does resolve the subscribe function succesfully. It return no Error, and accourding to the documentation it should be sufficient to call the loop_start(); to ensure that it connects as it should.

the client \<unknown> error is given at connect(this->broker.c_str()); which can mean that the setup in this function is not going as should. But the function itself return also no Error

ralight commented 5 years ago

Does the on_log logging show anything useful on the client?

nickper commented 5 years ago

on log: 16, Client mosq/lc7NxCxvBtlsB1bX8y sending CONNECT on log: 16, Client mosq/lc7NxCxvBtlsB1bX8y sending SUBSCRIBE (Mid: 1, Topic: diagnostics, QoS: 0, Options: 0x00) Waiting for samples... //is called after the initialize function in the main on log: 16, Client mosq/lc7NxCxvBtlsB1bX8y sending CONNECT on log: 16, Client mosq/lc7NxCxvBtlsB1bX8y received CONNACK (0)

It does send a subscribe according to the on_log. It doesn't show the second CONNECT/CONNACK on the I686 devices.

nickper commented 5 years ago

I have tried to build it on another build environment with another toolchain and also updated the linux version on de ARMv7 target. It still gives the same error.

ralight commented 5 years ago

updated the linux version - do you mean something newer than Wheezy? I haven't yet reproduced this on any architecture, but don't have anything running Wheezy.

The example code you provide is incomplete, is it possible to have a full working example that shows the problem?

nickper commented 5 years ago

yes, i tried it this time to build on a yocto ubuntu 18.04. and with updated system libraries on my device. unfortunatly the problem still persist.

I use a custom build linux OS which is higly based on debian wheezy. it uses kernel 3.10 which gave some trouble, but i created a workaround for that. i use that workaround on both devices, and it works on both. (i mentioned my workaround here #1403)

I will provide a working copy later today

nickper commented 5 years ago

I debuged the library and encountered an problem. In my case on the ARMv7 chip the library runs into a race condition where it internaly returns MOSQ_ERR_NO_CONN and tries to reconnect Problem is that this reconnect is for some reason not done correctly. By accident i encounterd that the problem was resolved after more that 2 print statements between the initializer and the first real socket action in the loop_forever function. Therefor i put a usleep right at the start of loop_forever and the problem disapeared.

//loop.c
int mosquitto_loop_forever(struct mosquitto *mosq, int timeout, int max_packets)
{
        usleep(400);
    int run = 1;
    int rc;
    unsigned int reconnects = 0;
    unsigned long reconnect_delay;
...

It is an ugly solution but for now it helps. I don't know if i am the only one with this problem, and if kernel version, linux environment and/or hardware specs is responsable for this, But i finally got it working.

It may be good to check if the reconnect function does work when the first connection is not performed well.

EDIT i forgot to mention that before my fix I had the Socket error on client <unknown>, disconnecting notification also when i tried to connect with my i686 device. But for some reason it didn't had any impact there.

ralight commented 5 years ago

Good find! Are you able to check with the latest fixes branch? There are some extra locks added where they were missing. It could be related.

nickper commented 5 years ago

I tried the fixes branch, but it doesn't resolve the problem. on the broker side i still recieve the message Socket error on client <unknown>, disconnecting. which is the indication of the race condition.

ralight commented 5 years ago

I haven't been able to reproduce this, but I think I can tell where the most likely cause for this is. I've just pushed a commit which may fix it.

karlp commented 5 years ago

This is a regression for me on both desktop linux (glibc, x86_64) and openwrt (musl-libc, mips32/ath79)

I use connect_async() followed by loop_start, and I simply never receive my on_connect callback. I'm using libevent2 for my own portion of the application, and if I send a signal that I'm handling via libevent2 (ctrl-c to cleanly exit) I finally see the connect callback firing before immediately my clean exit handler disconnecting and exiting.

karlp commented 5 years ago

test case for connect_async available at https://github.com/etactica/mosquitto/commit/f7e04bf963259d131f1ee57b991a6c6c1bce8162

ralight commented 5 years ago

There's an updated fix in the fixes branch that helps the regression and should help this too.