WebwareForPython / w4py3

Webware for Python 3
MIT License
20 stars 6 forks source link

w4py3 can't properly process binary post data #14

Closed zarquon5 closed 1 year ago

zarquon5 commented 1 year ago

To whom it may concern,

If you post a request to w4py3 with a binary content type of application/octet-stream or, say, application/x-7z-compressed, w4py3 will always try to decode the binary payload, usually resulting in a traceback like this:

Traceback (most recent call last):
  File "c:\ProgramData\conda\envs\myenv\lib\site-packages\webware\HTTPRequest.py", line 41, in __init__
    self._fields = FieldStorage.FieldStorage(
  File "c:\ProgramData\conda\envs\myenv\lib\site-packages\webware\WebUtils\FieldStorage.py", line 212, in __init__
    self.read_single()
  File "c:\ProgramData\conda\envs\myenv\lib\site-packages\webware\WebUtils\FieldStorage.py", line 409, in read_single
    self.read_binary()
  File "c:\ProgramData\conda\envs\myenv\lib\site-packages\webware\WebUtils\FieldStorage.py", line 434, in read_binary
    self.file.write(data.decode())
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbc in position 2: invalid start byte

This is because the FieldStorage code in w4py3, lifted from the deprecated cgi.py module, does an internal test on "whether or not the storage should be a binary file, viz:

            if self._binary_file:
                self.file.write(data)
            else:  # fix for issue 27777
                self.file.write(data.decode())     # <----- the problem line

However, unexpectedly, this self._binary_file flag is not set based on the Content type, but rather if and only if there is a Content-disposition header, and then only if a filename attribute in the options dictionary for that header (even though, as far as I can tell, that filename is otherwise ignored, viz:

        cdisp, pdict = '', {}
        if 'content-disposition' in self.headers:
            cdisp, pdict = parse_header(self.headers['content-disposition'])
            fslog.write(f'cdisp and pdict found in headers: {cdisp}, {pdict}\n')
        self.disposition = cdisp
        self.disposition_options = pdict
        self.name = None
        if 'name' in pdict:
            self.name = pdict['name']
        self.filename = None
        if 'filename' in pdict:
            self.filename = pdict['filename']
        self._binary_file = self.filename is not None

(I do not understand the reasoning behind this logic, as this header is normally associated with responses, rather than requests. But I digress...)

And, unfortunately, the HTTPRequest class does not pass headers into the FieldStorage constructor and the FieldStorage constructor does not parse this header out of the passed environment to create its own internal headers dictionary, viz:

        if headers is None:
            headers = {}
            if method == 'POST':
                headers['content-type'] = 'application/x-www-form-urlencoded'
            if 'CONTENT_TYPE' in environ:
                headers['content-type'] = environ['CONTENT_TYPE']
            if 'QUERY_STRING' in environ:
                self.qs_on_post = environ['QUERY_STRING']
            if 'CONTENT_LENGTH' in environ:
                headers['content-length'] = environ['CONTENT_LENGTH']

        else:
            if not (isinstance(headers, (Mapping, Message))):
                raise TypeError(
                    'headers must be mapping or an instance of'
                    ' email.message.Message')
        self.headers = headers

Because of the above logic, the _binary_file flag is therefore never set to true, and the upshot of this is that you can't post binary data to a w4py3 servlet like you could in w4py2, because read_binary() now always try to run data.decode() on it, which results in the above traceback for non-unicode POST payloads.

I'm not sure what the right fix here is, but my gut tells me that "binary" test should be based on the content-type, rather than the presence of the filename attribute in the content-disposition header; am I missing something?

One other thing: the exception handler for the FieldStorage constructor still assumes that the cgi package is importable:

            # Protect the loading of fields with an exception handler,
            # because bad headers sometimes can break the field storage
            # (see also https://bugs.python.org/issue27777).
            try:
                self._fields = FieldStorage.FieldStorage(
                    self._input, environ=self._environ,
                    keep_blank_values=True, strict_parsing=False)
            except Exception:
                self._fields = cgi.FieldStorage(keep_blank_values=True)
                traceback.print_exc(file=sys.stderr)

But, given that original cgi module is already deprecated and scheduled to go away in python 3.13, is there a plan to replace all of this the FieldStorage logic with the recommended (alas, non-core) "multipart" package?

Cito commented 1 year ago

Hi @zarquon5 and thanks for reporting this. FieldStorage is indeed a sore spot of Webware (and the standard lib).

Webware always had its own modified FieldStorage class, because it handles parameters a bit differently from the standard (see the docstring of WebUtils.FieldStorage). It had been done by subclassing from the original class (which frequently broke due to changes in its signature in newer Python versions) , but recently the class was completely incorporated into Webware. This is already in anticipation of the removal of the cgi module from the standard lib and also makes it possible to fix Python issue 71964 and other issues which will not be fixed any more in the standard lib.

This also allowed me to fix your problem in the main branch which I was able to reproduce (well more like band-aid than fix, but it works).

I do not understand the reasoning behind this logic, as this header is normally associated with responses, rather than requests.

As far as I understand the reason is that multipart requests use this header as well.

you can't post binary data to a w4py3 servlet like you could in w4py2, because read_binary() now always try to run data.decode() on it, which results in the above traceback for non-unicode POST payloads.

The main problem here is that the cgi module was written long ago for Python 2 which used byte strings everywhere. When it was ported to Python 3, they tried to distinguish between native strings and byte strings, but as you noticed the Python 3 this has some flaws and does not work properly in some cases, for instance the above mentioned issue 71964, or the case you detected. It was not a problem for w4py2 because it used Python 3. But now that w4py3 uses Python 3, these problems pop up.

The data.decode() was introduced to fix issue 71964, therefore we cannot simply remove it.

I'm not sure what the right fix here is, but my gut tells me that "binary" test should be based on the content-type, rather than the presence of the filename attribute in the content-disposition header; am I missing something?

In principle I would agree, but deriving it from the content-type would be rather difficult. For instance, the application/ mime types are partly binary (like application/octet-stream or application/zip), and partly text based (like application/json or application/xhtml+xml), and as far as I see Python does not offer a way to discern between these types.

So as a fix I now simply try to first decode the file and if that fails, set _binary_file to True. Yes, it's a bit ugly, because theoretically you could have binary content that is decodable as text, and then you would get it back as a native string instead of byte string. But at least it does not break.

I have also added unit tests that cover this case (In addition to the unit tests from the standard lib and for the existing fix). So I think the current FieldStorage is pretty robust now (at least more robust than the one in the standard lib).

One other thing: the exception handler for the FieldStorage constructor still assumes that the cgi package is importable

That's currently not a problem since it's still available in all supported Python versions. But I have changed it now, thanks for pointing this out.

is there a plan to replace all of this the FieldStorage logic with the recommended (alas, non-core) "multipart" package?

Yes, that's the long-time goal. But it may be difficult to do that without breaking existing code. The current 3.0 release aims at being as compatible as possible since most people use Webware to run old code. Maybe there will be a minor or major release which introduces breaking changes in the future, that would be then the opportunity to get rid of FieldStorage.

I would highly appreciate if you could check out the main branch and give feedback or ideas for improvement.

One improvement I could think of is to unconditionally set _binary_file to True for all well-known binary mime types starting with application/... (like the ones you mentioned) and all mime types starting with everything else except text/.... The existing fix would then cover the remaining cases. What do you think?

zarquon5 commented 1 year ago

Here's what ChatGPT had to say on the subject, when asked about the best method for reliably detecting binary content in payloads, and a comprehensive list of binary content-types:

"There are several Content-Types that are commonly used to indicate binary data in an HTTP POST request. Here's a list of some common ones:

  1. application/octet-stream: This Content-Type is used to indicate arbitrary binary data. It does not specify any particular format for the data.
  2. application/octet-stream; charset=binary: This Content-Type is similar to the plain "application/octet-stream" type, but with the addition of the "charset=binary" parameter. This explicitly specifies that the data is binary.
  3. image/*: This Content-Type is used to indicate image data. Image data is typically encoded in a binary format, such as JPEG or PNG.
  4. audio/*: This Content-Type is used to indicate audio data. Audio data is typically encoded in a binary format, such as MP3 or WAV.
  5. video/*: This Content-Type is used to indicate video data. Video data is typically encoded in a binary format, such as MPEG or AVI.
  6. application/pdf: This Content-Type is used to indicate PDF documents, which are typically encoded in a binary format.
  7. application/zip: This Content-Type is used to indicate ZIP archives, which are also typically encoded in a binary format.
  8. application/x-gzip: This Content-Type is used to indicate data that has been compressed using the gzip algorithm. The compressed data is binary.
  9. application/x-7z-compressed: This Content-Type is used to indicate data that has been compressed using the 7-Zip compression algorithm, which is commonly used for archiving and file compression.

It's important to note that these Content-Types do not guarantee that the payload of an HTTP POST request is binary data, as the actual content of the request body can vary even within the same MIME type. However, if the Content-Type header is set to one of these values, it's more likely that the payload is binary data. It's [also] important to note that these headers are not always present in an HTTP POST request, and even when they are, they may not provide a definitive answer to whether the payload is string or binary data. Therefore, it's usually best to examine the actual content of the request body to determine the type of data being sent."

So I like your hybrid approach - assume binary if one of these common content types, and also be prepared to switch to binary if we guess incorrectly and the decode doesn't work. Since I think all data can be binary, I don't think you lose anything by defaulting to it, and I think this should also minimize the need for the external exception handler and fallback to the cgi module.

Cito commented 1 year ago

Ok, I'll implement something along these lines and cut a new release later next week.

Cito commented 1 year ago

This has been released in version 3.0.8 now.