eclipse-paho / paho.mqtt.python

paho.mqtt.python
Other
2.22k stars 725 forks source link

Suggestion: pedantic string encoding management #144

Open cladmi opened 7 years ago

cladmi commented 7 years ago

In the same way as python3 removed automatic conversion from string to bytes, I would like to have a way to prevent publish/will_set from auto converting payloads. My problem started with paho encoding python2 bytes to utf-8 event if it should not (I saw the PR to fix it), and I then tried to find if encoding was well managed in my application.

Now in my client I just sub-classed 'publish' to assert payload is not an unicode string, (and convert bytes to bytearray for the bug). I tried taking care of encoding since the beginning but this made me see many places where auto-conversion allowed bad string handling in my code.

Also, in practice, paho is able to automatically encode to utf-8 but cannot, of course, decode automatically so the magic is not symmetric.

Ideas on how to implement it:

I would even make the auto-conversion raise a warning when it is not respected. Crashing would be problematic as it can happen dynamically on a really well hidden case. But this is a maintainer choice with other problems in mind.

jamesmyatt commented 7 years ago

I'm inclined to agree that paho should not take it upon itself to convert payloads, if possible. When it is absolutely necessary, it should be configurable as far as possible and should raise a warning when it is not configured.

Note that it is part of the MQTT spec for all of the following to be UTF-8 strings: Protocol Name, ClientId, Will topic, User name, Topic name, Topic filter. But I don't see any requirement for any other field to be UTF-8. Furthermore, if any of these fields contain ill-formed UTF-8, then the server or client MUST close the network connection.

Also: https://github.com/mqtt/mqtt.github.io/wiki/clarify_utf8_strings