Open mnaberez opened 5 years ago
Description eventlistener header tokens:
len | An integer indicating the number of bytes in the event payload, aka the PAYLOAD_LENGTH
From above, it sounds like the meaning of the len
header unintentionally changed from bytes to characters. This is something we should probably investigate. There are a lot of existing eventlisteners.
Thanks for making this. I put my issue over there because it looked related to doing byte->string conversion in an unexpected place to me but you're right, I think, that the recipient of the converted data is different.
According to the docs,
In the below list, we indicate that some event types have a “body” which is a a token set. A token set consists of a set of charaters with space-separated tokens. Each token represents a key-value pair. The key and value are separated by a colon.
It's not clear to me why you would be seeing arbitrary (not utf-8 decodable) binary data in the event listener. Which event types are being listening for where the problem is seen?
Also according to the docs:
len | An integer indicating the number of bytes in the event payload, aka the PAYLOAD_LENGTH
And:
Directly following the linefeed character in the header is the event payload. It consists of PAYLOAD_LENGTH bytes representing a serialization of the event data. See Event Types for the specific event data serialization definitions.
And:
Once it has processed the header, the event listener implementation should read PAYLOAD_LENGTH bytes from its stdin, perform an arbitrary action based on the values in the header and the data parsed out of the serialization.
When Supervisor3 was running under Python2 the payload length did indeed represent the number of bytes to read. In Supervisor4 the payload length now represents characters. My event listener was listening for bytes under Supervisor3 and when I upgraded to Supervisor4 my event listener stopped working because the stream now came as characters and not bytes.
Sorry I misspoke in the last sentence. The stream might have always come as characters and not bytes (not sure tbh) but in Supervisor3 under Python2 that didn't matter because Python didn't really distinguish between bytes and characters very effectively and the payload length represented bytes to read. In Supervisor4 under Python3 the payload length definitely changed to represent characters. So the docs are wrong in the sense that it says bytes when it is actually characters.
I'd actually argue that it'd be better if the supervisor event listener protocol changes to NOT do utf8 conversions and just gives event listeners raw data so that I can decide what the encoding is in my event listener rather than get messages that literally say "indecipherable".
And since you asked but I have yet to answer this is happening on log events. Sorry I didn't say that sooner and for replying to this thread so many times.
Well, the bytes vs. chars thing in the documentation could be addressed just by changing the documentation to say "characters" rather than "bytes". I'm not sure it was really intended to mean bytes, specifically, since in Python 2.x the distinction was somewhat loose. Note that most of the output of events (input to event listeners) is actually logically text rather than binary.
For programs interacting with Supervisor, it would seem that their stdout
and stderr
should logically be text streams - are you saying they are not? TBF, the output in the "indecipherable" case is Undecodable:
followed by the binary data in repr
format, which certainly allows it to be recovered.
I'll leave it to Mike to decide how this should be handled, from a policy point of view. I'm not sure what's best at this point, but I would want to see how many people are running into problems with arbitrary binary data in log streams (to understand the use cases better) before deciding what the best fix is.
If the documentation is changed you should also change the upgrade guide for 3->4 to indicate a breaking change. As I'm sure you know, sys.stdin and sys.stdout in Python3 are byte streams and not text streams. I had to change my code from this:
stdin = sys.stdin.buffer
to this:
stdin = io.TextIOWrapper(sys.stdin.buffer, encoding="utf-8", newline='\n')
in order to make .read(PAYLOAD_LENGTH)
return the correct data since without that change .read
would read bytes and not characters and when my log lines contained UTF-8 data the number of bytes did not match the number of characters.
I guess what I'm saying is that this policy is backwards incompatible and should either change to be backwards compatible or the docs should be updated to say "hey we made this breaking change in the migration from 3 to 4." But what I'd strongly prefer is that the documentation and functionality from version 3 stay the same. I say that because (1) it's what the docs say, (2) it's probably how a lotof event listeners were implemented, and (3) it's a better interface to let the receiver of the event listener decide what encoding their log lines are in rather than deciding for them.
I appreciate you taking the time to understand the issue that I'm facing.
As I'm sure you know, sys.stdin and sys.stdout in Python3 are byte streams and not text streams.
Which version of Python 3 are you running?
$ python3
Python 3.7.3+ (heads/3.7:0ff08b061b, May 11 2019, 09:27:05)
[GCC 7.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.stdin
<_io.TextIOWrapper name='<stdin>' mode='r' encoding='UTF-8'>
>>> sys.stdout
<_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>
>>> sys.stderr
<_io.TextIOWrapper name='<stderr>' mode='w' encoding='UTF-8'>
$ python3.6
Python 3.6.7 (default, Oct 22 2018, 11:32:17)
[GCC 8.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.stdin
<_io.TextIOWrapper name='<stdin>' mode='r' encoding='UTF-8'>
>>> sys.stdout
<_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>
>>> sys.stderr
<_io.TextIOWrapper name='<stderr>' mode='w' encoding='UTF-8'>
$ python3.5
Python 3.5.7+ (heads/3.5:2bb327816d, May 11 2019, 08:47:36)
[GCC 7.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.stdin
<_io.TextIOWrapper name='<stdin>' mode='r' encoding='UTF-8'>
>>> sys.stdout
<_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>
>>> sys.stderr
<_io.TextIOWrapper name='<stderr>' mode='w' encoding='UTF-8'>
$ python3.4
Python 3.4.10 (default, May 11 2019, 08:42:05)
[GCC 7.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.stdin
<_io.TextIOWrapper name='<stdin>' mode='r' encoding='UTF-8'>
>>> sys.stdout
<_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>
>>> sys.stderr
<_io.TextIOWrapper name='<stderr>' mode='w' encoding='UTF-8'>
Ah, you're right, sys.stdin
is a character stream. But in my example I use sys.stdin.buffer
which is a byte stream of type io.BufferedReader
. The reason I use the byte stream wrapped in the character stream is because, by default, TextIOWrapper
implements universal newlines which means that when my log lines have "\r\n" in them (which happens) the "\r\n" gets turned into "\n" and then my calls to .read(PAYLOAD_SIZE)
wait for an extra character that will never come.
means that when my log lines have "\r\n" in them (which happens)
How come, if you're not on Windows? This is a different problem to what you described earlier (which related to arbitrary encodings, not utf-8
).
How come, if you're not on Windows?
My programs run on a Unix platform but sometimes they log garbage that was given to them which can include carriage returns. You know, like a webapp that says in its logs "unauthorized access from foo\r\nbar because it's an invalid username". Sometimes programs log things over which the author has no control. Or maybe I have a third party program running under Supervisor that just does its own thing and I can't change that.
This is a different problem to what you described earlier (which related to arbitrary encodings, not utf-8).
The universal newline thing is a separate problem that is not a problem with Supervisor which is why I didn't mention it before. But since you suggested using sys.stdin
the universal newline problem is why I can't use sys.stdin
.
To restate the actual problem, the problem here is that Supervisor gives characters instead of bytes when it tells you the length of an event's payload. The reason that this is a problem is because the documentation says that it should return bytes. You said, well sys.stdin
is a character stream so just use that -- see? no breaking change? And I said but sys.stdin
has this universal newline problem so one must use sys.stdin.buffer
which is a byte stream. Because of the universal newlines problem, I and likely anyone else who has implemented an event listener in Python3 will be working with a byte stream unless they've decided that they'll never see a carriage return in their log files.
You said, well sys.stdin is a character stream so just use that -- see?
Where did I say that? I was just looking for more information to diagnose the problem, and merely pointed out that sys.stdin
etc. are character streams in response to your comment that they are byte streams. Before you mentioned it, the issue of newlines was for me completely out of the picture for this issue.
The reason that this is a problem is because the documentation says that it should return bytes.
It used to be bytes under Python 2.x, but as we know, Python 2.x plays a bit fast and loose with bytes and text and fudges things here and there. It's not yet clear to me that it needs to be or should be bytes under Python 3.x because, by default, the process streams are character streams under Python 3.x. (But, that's not my call - I'm not an official Supervisor maintainer, merely an interested party).
As I said earlier, the documentation discrepancy could be fixed by updating the documentation, but it doesn't seem like that would solve your problem.
I and likely anyone else who has implemented an event listener in Python3 will be working with a byte stream unless they've decided that they'll never see a carriage return in their log files.
Maybe so. But that would likely be a much smaller number of situations than a more general problem of randomly-encoded binary data appearing in logs.
By the way, I noticed that you are using the buffer
attribute of the sys.stdin
text stream. According to the documentation for it,
This [the
buffer
attribute] is not part of the TextIOBase API and may not exist in some implementations.
Just thought I'd mention it.
Geez I did not read the full documentation on .buffer
apparently. Holy cow Python, between universal newlines and giving character streams for stdin/stdout how do they expect anyone to work with streams in any coherent fashion and make any decisions on their own about encodings or, for that matter, work with binary data? I'm now mostly rescinding my call to change Supervisor to use byte streams. Just as long as the documentation is fixed and the upgrade guide noted that this is a potentially breaking change I am happy.
My problem is actually moot at this point. I have a solution that works, though I don't think it's pretty and it definitely caught me off guard and I don't think I'm doing anything particularly fancy or special. I figured, based on the docs, that other people would likely have this same problem and that Supervisor shouldn't be making encoding decisions. It looks, however, like Python is making a lot of these decisions and Supervisor is trying to living inside of them. So, yeah, probably not as much to do here as I thought.
Ugh. Python. Thanks for your patience with me.
Geez I did not read the full documentation on .buffer apparently.
According to this documentation on sys.stdin
, it actually suggests using buffer
, so you're not wrong to use that - it's just a caveat about portability on other platforms (which is probably less of an issue in this case).
Just as long as the documentation is fixed and the upgrade guide noted that this is a potentially breaking change I am happy.
Seems reasonable to me.
I have a solution that works, though I don't think it's pretty
Care to share it, anyway? Or is it just the
stdin = io.TextIOWrapper(sys.stdin.buffer, encoding="utf-8", newline='\n')
you mentioned earlier?
By the way, I had a play with a couple of short scripts and a test file with CRLFs, all saved in ~/tmp:
A file with CRLF
newlines and some Unicode (like Øresund, Malmö and Eyjafjallajökull)
just for testing with.
(should be saved as crlf.txt
with CRLF line endings)
#!/usr/bin/env python3
import os
import sys
fn = os.path.expanduser('~/tmp/crlf.txt')
with open(fn, 'rb') as f:
data = f.read()
byte_len = len(data)
data = data.decode('utf-8')
print('len:%d byte_len:%d\n%s\nfoobarbaz' % (len(data), byte_len, data))
(should be saved as writer.py
and execute-enabled)
#!/usr/bin/env python3
import os
import sys
header = sys.stdin.readline()
print('Header: %s' % header.rstrip())
fields = dict(s.split(':') for s in header.split())
paylen = int(fields['len'])
print('Reading %d chars:' % paylen)
payload = sys.stdin.read(paylen)
print(repr(payload))
(should be saved as reader.py
and execute-enabled)
Then, if I run:
~/tmp $ ./writer.py | ./reader.py
Header: len:112 byte_len:115
Reading 112 chars:
'A file with CRLF\r\nnewlines and some Unicode (like Øresund, Malmö and Eyjafjallajökull)\r\njust for testing with.\r\n'
Notice that I'm using sys.stdin
, and using the len
value provided in the header to read the number of chars specified. This seems to be working as expected (no loss of '\r' characters, and the whole payload is read), so can you provide a similar setup which shows up your problem? My python3
is Python 3.7.3.
@plockaby @vsajip Thank you both for your investigations into this issue and great discussion points made here.
Event listeners running under Supervisor may be written in Python 2 or 3, but they may also be written in other languages. There are open source Supervisor event listeners written in PHP, Ruby, Perl, JavaScript, Golang, and possibly others too. Not all of the existing event listeners use the PROCESS_LOG
events, but ideally a user would be able to easily consume PROCESS_LOG
events in the language of their choice.
A given language may or may not have good Unicode support, and maintaining compatibility with existing event listeners should be a goal. I would like your thoughts on the idea of putting both the bytes-length and the characters-length in the header.
The event header is versioned:
ver:3.0 server:supervisor serial:21 pool:listener poolserial:10 eventname:PROCESS_LOG_STDOUT len:54
This is the version of the events API, not the version of Supervisor. In Supervisor 4, ver:3.0
is still emitted.
Existing event listeners, assuming they are well-written, should be able to accept extra key-value pairs in the event header. I am thinking about the following:
ver:3.0
to ver:4.0
.len:
to be the bytes-length.encoding:utf-8
to indicate character encoding. Supervisor makes various UTF-8 assumptions since 4.0, so we would probably just hardcode this header for now.charlen:
for characters-length in the specified encoding.The rationale for the above suggestion is:
ver:
header and expect len:
to be bytes would continue to work. Existing event listeners would work under Supervisor 3.x or 4.x the same way.ver:4.0
(or the new keys) and would have a choice of reading in bytes by len:
or in characters by charlen:
.Please let me know what you think. If we agree this solution (or similar) would work, I would merge a pull request for it. I think it could be released in a 4.0.x point release, with the changelog saying it is a bug fix for compatibility with existing event listeners.
This seems like a reasonable approach, but if existing event listeners running under Python 3 use sys.stdin
rather than sys.stdin.buffer
, ISTM this approach won't work in general if the len
field has the byte length (this will typically be larger than the char len if there's Unicode present, and so could lead to waiting for bytes that never arrive or reading additional bytes). IMO another way of looking at it is - sys.stdin
goes with len
, whereas sys.stdin.buffer
goes with a byte length.
I'd wait for @plockaby to show us an example config/scripts with a failing scenario (as my simple example works fine with Unicode in the data as well as a mix of \n
and \r\n
, using the len:
value and sys.stdin
). Also, what do the third-party event listeners you (@mnaberez) maintain do? Do they use sys.stdin
or sys.stdin.buffer
? Do we know of other published third-party event listeners out there? The mailing lists don't seem active - is there a Google group or other forum where we can ask for feedback?
I'd contend that Supervisor isn't meant for processing arbitrary binary output from child processes via event listeners - stuff written to stdout
and stderr
by a Supervisor child should be logically text, not binary. I agree that one can't in general expect perhaps-third-party child programs to do the right thing, but current behaviour is to deliver 'Undecodable: ' + a repr of the undecodable binary, which allows full recovery of the binary data by the event listener at the cost of a bit of extra work. If we wanted a more generic solution, we could output 'Undecodable: ' + a base-64 encoded value rather than a Python repr.
If you accept the should-be-text-only logic I've propounded, then len:
would be fine as is - a char length to be used in conjunction with sys.stdin
. I would wait for more information about potentially impacted Supervisor installations before making a change in this area. Perhaps the title of this issue should be changed to Eventlistener "len" header mismatch in the presence of binary data
or similar so that others who run into an issue can spot this and comment on the issue here.
I'll have an example soon. I'll be out of town the next few days, unfortunately. Sorry for the delay.
There's no rush.
This seems like a reasonable approach, but if existing event listeners running under Python 3 use sys.stdin rather than sys.stdin.buffer, ISTM this approach won't work in general if the len field has the byte length (this will typically be larger than the char len if there's Unicode present, and so could lead to waiting for bytes that never arrive or reading additional bytes). IMO another way of looking at it is - sys.stdin goes with len, whereas sys.stdin.buffer goes with a byte length.
Supervisor 4.x was only released to PyPI as stable a few months ago. The number of event listeners using this particular feature under Python 3 is likely very small.
I expect most Python users will not use stdin
or stdin.buffer
directly but instead
will use childutils
. Once we reach a decision on this, we can make childutils
do the right thing under Python 2 or 3.
Most Python programs don't work on 2 and 3 anyway without porting (in my experience). If a
Python programmer is not using childutils
, selecting stdin
vs. stdin.buffer
and which
header to use is a porting issue that we can document.
Also, what do the third-party event listeners you (@mnaberez) maintain do? Do they use sys.stdin or sys.stdin.buffer?
I do maintain a couple of event listeners that use this feature but they are not public and not written
in Python. They forward stdout
messages to a logging service.
Do we know of other published third-party event listeners out there?
We have a list of third-party plugins. Not all are event listeners and not all of the event listeners will use the logs event. However, if an event listener subscribes to all events (events=
in the config file) then this issue could cause any listener to break. The listener can't get out of sync with the payload size. The protocol requires the listener to acknowledge every event.
The mailing lists don't seem active - is there a Google group or other forum where we can ask for feedback?
Unfortunately, the mailing list is in limbo at the moment (issue 1241). It disappeared without notice recently. I've sent emails to its owners but they have not responded yet. There is not another forum yet.
I'd contend that Supervisor isn't meant for processing arbitrary binary output from child processes via event listeners - stuff written to stdout and stderr by a Supervisor child should be logically text, not binary.
A subprocess should be able to emit any binary garbage and this should not cause supervisord
, supervisorctl
, or event listeners to crash. Beyond that, I agree on this point.
If you accept the should-be-text-only logic I've propounded, then len: would be fine as is - a char length to be used in conjunction with sys.stdin. I would wait for more information about potentially impacted Supervisor installations before making a change in this area. Perhaps the title of this issue should be changed to Eventlistener "len" header mismatch in the presence of binary data
The len:
header was documented as meaning bytes-length for over a decade. Having it
mean one thing under Supervisor 3.x and a different thing under 4.x is confusing. This could
be addressed with documentation but we can avoid the confusion altogether by restoring len:
to its original meaning of bytes-length and adding a new header for characters-length.
Supervisor 4.x needs to coexist with Supervisor 3.x for several more years. Supervisor 3.x
is packaged by LTS distributions (e.g. Ubuntu 18.04) and many users install it this way.
I think we should try to keep event listeners as compatible as possible. Having len:
mean
one thing on 3.x and a different thing on 4.x will cause incompatibilities that we can avoid.
This issue affects more than junk binary data from stdout. Most systems will have UTF-8 as the
encoding, which is multibyte. Consider that 1 multibyte UTF-8 character is written to stdout.
Existing event listeners that expect len:
to the bytes-length will be broken. They will not
read enough bytes. I think this may be what happened to @plockaby.
I am still thinking we should probably put len:
back to bytes-length and add charlen:
or similar for all these reasons, but I am certainly open to further discussion.
I don't disagree with anything you said - backward compatibility is important but can be hard to achieve elegantly. I was just putting forward the other viewpoint, but either way, some people will trip up.
Encouraging use of childutils
could be done by means of code examples in the documentation, perhaps explicitly covering the bytes and text scenarios.
A subprocess should be able to emit any binary garbage and this should not cause
supervisord
,supervisorctl
, or event listeners to crash.
Definitely, but of course badly written third-party event listeners might. Currently, AFAIK, with arbitrary binary data, the Supervisor event listener code won't crash, just return 'Undecodable: ' + the repr of the binary. This is sub-optimal for non-Python event listeners, so we could and perhaps should switch to base64 encoding the binary data.
I don't disagree with anything you said - backward compatibility is important but can be hard to achieve elegantly. I was just putting forward the other viewpoint, but either way, some people will trip up.
Yeah, I think that unfortunately no matter what we do, this issue is going to trip up some people. I mainly have the opinion we should put len:
back to its original meaning of bytes-length and add another header for characters-length because I think it will have the smallest impact (that's my hope, anyway).
Encouraging use of childutils could be done by means of code examples in the documentation, perhaps explicitly covering the bytes and text scenarios.
I agree we should do this and have explicit examples in Python. We can't help event listeners in other languages but it will at least draw sufficient attention to the issue.
Definitely, but of course badly written third-party event listeners might.
Yeah, I don't think there's much we can do there. Badly-written event listeners can crash for a lot of other reasons, too. I think the usual result is that supervisord
will rebuffer the event, restart the event listener, and the event listener will crash again. This will repeat until someone notices.
Currently, AFAIK, with arbitrary binary data, the Supervisor event listener code won't crash, just return 'Undecodable: ' + the repr of the binary. This is sub-optimal for non-Python event listeners, so we could and perhaps should switch to base64 encoding the binary data.
I agree with you that stdout
should logically be text. I only have the opinion on len:
going back to bytes because of compatibility concerns. However, since we can't control what a given subprocess will output, we have to deal with the possibility of binary junk.
I think that you're right that we should use base64 (or something not Python-specific) for the binary data in the undecodable message. We should probably prefix this message with supervisord:
or similar so that it's clear the message is from supervisord
, since it's just mixed in the stream with the stuff from stdout.
@plockaby I'd be interested in your thoughts on the proposal as well, since you discovered this and have experience working with event listeners.
@vsajip I am probably not going to be able to work on this for a while. If we all reach agreement and you are able to work on a patch, I'd appreciate that and would be happy to merge it and release it as a point release.
Sorry that I've been AWOL on this conversation. I'm literally putting together my example right now.
Sorry that I've been AWOL on this conversation. I'm literally putting together my example right now.
No problem, it hasn't been very long, and thanks!
I've made this gist: https://gist.github.com/plockaby/10b75dfa72dc952c5bdb7000328c9b72
It has the code for my event listener and two files that I used to test it. I found that using these work:
stdin = io.TextIOWrapper(sys.stdin.buffer, encoding="utf-8", newline="\n")
stdin = sys.stdin
But that using either of these does not work:
stdin = io.TextIOWrapper(sys.stdin.buffer, encoding="utf-8")
stdin = sys.stdin.buffer
I can't find any documentation on whether sys.stdin
implements universal newlines by default. PEP278 says that it should but my testing says that it does not. The documentation on stdin is ambiguous in this area, I think, and my google-fu could find no definitive answers and only found people asking whether it does or not.
One thing is certain and that is that my assertion earlier that sys.stdin is a byte stream was woefully wrong and based on a poor understanding of what was happening under code written to use Python2. I haven't tried implementing an event listener for Supervisor running on Python3 using languages other than Python so I'm not sure how this would play anywhere else.
My preference, I suppose, is to see "len" stand for bytes to match the documentation and for "charlen" to indicate the number of characters. Either way someone's event listener will broke with the update to Python3 if they're doing anything outside of ASCII and that's just how it is.
I've made this gist
Thanks. Any chance you could add the Supervisor conf file that you used for testing?
Also, just to confirm my understanding from just reading the Gist - sys.stdin
works as-is with len:
(number of chars, rather than bytes)? So you wouldn't have even seen/raised this issue if you'd just used sys.stdin
in the first place? Which is what one is supposed to do - treat sys.stdin
as a character stream.
I realise that doesn't mean everything works fine for non-Python event listeners, where the nature of stdin
might be different.
I've added a supervisord.conf file to my gist.
You are correct that using sys.stdin
seems to work in my example. I'm cautious about relying on it since the functionality of universal newlines with respect to sys.stdin
, as far as I can determine, is not well documented.
I don't recall it working when I was trying to solve my production problem but that could have been because I was not fully grokking that sys.stdin
already was a text stream and may have been doing additional things to it that broke it more. I can't say with any certainty what my problem initially was, only what my test code shows. My test code that I've shared with you shows that it works. But again, I'm not comfortable relying on it given the lack of documentation around one feature that I can say with certainty did break for me.
I've added a supervisord.conf file to my gist.
Thanks - appreciated.
I hear what you say, but my take-away from it is that we should wait to see if any more errors in this area are reported, in order to gather better failure data, before making any changes.
I hear what you say, but my take-away from it is that we should wait to see if any more errors in this area are reported, in order to gather better failure data, before making any changes.
I don't think we should wait. The 4.x eventlistener protocol is inconsistent with all prior 3.x versions (which are still in widespread use) and inconsistent with the documentation. This is a bug. I don't think we need more failure reports as we seem to already understand the technical reasons behind the problem.
Just to weigh in, I've just hit this problem in my project.
I would prefer keeping len:
to mean bytes, and ideally it would be great if supervisord didn't change the content of the payload if it isn't valid UTF-8, but this is work-aroundable.
I'm running Supervisor 4.2.0 and Python 3.8.0 and in the process of writing an event listener came across this issue. As I understand this issue, using childutils
from the same version of Supervisor and running under the same python as the supervisor process, there shouldn't be an error?
@plockaby wrote comments of issue #1235:
cc: @vsajip