eclipse / paho.mqtt.python

paho.mqtt.python
Other
2.12k stars 722 forks source link

due to select() paho-mqtt is unable to connect if more than 1024 file handle are used #819

Open showfuture opened 4 months ago

showfuture commented 4 months ago

Problem description:

Python version: 3.9 Paho-MQTT version: 1.6.1 When using 1000 threads, each thread as a client to connect to the MQTT service, due to the _socketpair_compat function in loop_start, only a few hundred clients can be connected, and all clients cannot be connected successfully. After adjusting the system file handle number to 65535, it still fails to connect. However, if the _socketpair_compat function is commented out, all clients can connect successfully.

Question:

Is there any way to solve this problem?

JamesParrott commented 4 months ago

If you really need 1000 threads I would strongly suggest a library with native async support, e.g.:

https://github.com/toreamun/asyncio-paho

PierreF commented 4 months ago

That's a nice issue... pretty obscure to find the cause if you never see such issue. tl; dr: we should no longer use select()

Here is how to reproduce the same issue you had with an every more strange code:

import paho.mqtt.client as mqtt
import time

# Here the magic happen :)
files = [open("/etc/hosts") for _ in range(1019)]

mqttc = mqtt.Client(mqtt.CallbackAPIVersion.VERSION2)
mqttc.connect("mqtt.eclipseprojects.io")
mqttc.loop_start()

time.sleep(5)  # Give network the time to do the handshake
print(mqttc.is_connected())

This will fail, the client will not be connected. To fix this code, just change the number 1019 in 1018 :)

More seriously, the issue is:

>>> mqttc._sockpairR
<socket.socket fd=1024, family=2, type=1, proto=0, laddr=('127.0.0.1', 52282), raddr=('127.0.0.1', 45195)>
>>> select.select([mqttc._sockpairR], [], [], 1)  # This is approximately what loop does
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: filedescriptor out of range in select()

This issue is that select (only on Linux ?) can't work with FD >= 1024

WARNING: select() can monitor only file descriptors numbers that are less than FD_SETSIZE (1024) -- https://manpages.debian.org/unstable/manpages-dev/select.2.en.html

In your program, you should have about 340 connections working. Socket pair (as it name said) create 2 FDs. 340 * 3 (the MQTT socket & the two sockets of the socket pair) = 1020. Then add stdout, stdin and stderr -> 1023.

The immediate fix is to don't use select() which means don't use loop(), loop_start() or loop_forever(). This mostly means use the external loop a.k.a an ayncio (either with a third-party that wrap it, or directly - there is some example). It should also be possible to use multiple processes to spread the connections to avoid reaching the FD number 1024, but I think this is too complex for the neeed.

The right fix is to change paho so that it stop using select() and use modern solution (probably Python selectors).

showfuture commented 4 months ago

If you really need 1000 threads I would strongly suggest a library with native async support, e.g.:

https://github.com/toreamun/asyncio-paho

thanks,I will try!

showfuture commented 4 months ago

This mostly means use the external loop a.k.a an ayncio (either with a third-party that wrap it, or directly - there is some example

Can you provide me with some examples or other packages that can solve this problem?

PierreF commented 4 months ago

This mostly means use the external loop a.k.a an ayncio (either with a third-party that wrap it, or directly - there is some example

Can you provide me with some examples or other packages that can solve this problem?

I'm not using paho-mqtt with asyncio, so I don't really know one. I've seen the name https://github.com/sbtinstruments/aiomqtt passed in another issue. You can also look at:

showfuture commented 4 months ago

thanks!