cdent / tank

Tiddlers Are N[eo][tw] Knowledge
BSD 3-Clause "New" or "Revised" License
12 stars 7 forks source link

Error uploading resource with non utf-8 bytes #137

Closed edrex closed 9 years ago

edrex commented 9 years ago

Example using tank helper script from https://gist.github.com/edrex/8aaea3f453fe46237c1f

tank PUT /bags/mgsd/tiddlers/app Content-Type:'text/html' < a_file_with_non_utf8_bytes.html

HTTP/1.1 400 Bad Request
Connection: close
Content-Encoding: gzip
Content-Length: 1245
Content-Type: text/html; charset=utf-8
Date: Sat, 11 Apr 2015 17:35:35 GMT
Server: Apache/2.2.22 (Debian)
Vary: Accept-Encoding

Drag and drop upload through browser:

screen shot 2015-04-11 at 10 37 14 am

HTTP/1.1 500 Internal Server Error
Date: Sat, 11 Apr 2015 17:36:50 GMT
Server: Apache/2.2.22 (Debian)
Access-Control-Allow-Origin: https://tank.peermore.com
Access-Control-Expose-Headers: ETag, Content-Type, X-Tank-Key
Access-Control-Allow-Credentials: true
Vary: Accept-Encoding
Content-Encoding: gzip
Content-Length: 563
Connection: close
Content-Type: text/plain; charset=UTF-8

Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/httpexceptor/__init__.py", line 58, in __call__
    return self.application(environ, start_response)
  File "/home/cdent/tiddlywebs/tank.peermore.com/tiddlywebplugins/tank/httperror.py", line 25, in __call__
    return self.application(environ, start_response)
  File "/usr/local/lib/python2.7/dist-packages/tiddlyweb/web/wsgi.py", line 190, in __call__
    output = self.application(environ, start_response)
  File "/usr/local/lib/python2.7/dist-packages/tiddlyweb/web/serve.py", line 158, in __call__
    return self.application(environ, start_response)
  File "/usr/local/lib/python2.7/dist-packages/tiddlyweb/web/serve.py", line 117, in __call__
    return self.application(environ, start_response)
  File "/usr/local/lib/python2.7/dist-packages/tiddlyweb/web/query.py", line 44, in __call__
    return self.application(environ, start_response)
  File "/usr/local/lib/python2.7/dist-packages/tiddlyweb/web/wsgi.py", line 128, in __call__
    return self.application(environ, start_response)
  File "/usr/local/lib/python2.7/dist-packages/tiddlyweb/web/extractor.py", line 34, in __call__
    return self.application(environ, start_response)
  File "/usr/local/lib/python2.7/dist-packages/tiddlyweb/web/wsgi.py", line 38, in __call__
    return self.application(environ, start_response)
  File "/usr/local/lib/python2.7/dist-packages/tiddlyweb/web/negotiate.py", line 30, in __call__
    return self.application(environ, start_response)
  File "/usr/local/lib/python2.7/dist-packages/tiddlywebplugins/cors.py", line 87, in __call__
    return self.application(environ, start_response)
  File "/usr/local/lib/python2.7/dist-packages/tiddlywebplugins/csrf.py", line 93, in __call__
    return app()
  File "/usr/local/lib/python2.7/dist-packages/tiddlywebplugins/csrf.py", line 75, in app
    output = self.application(environ, fake_start_response)
  File "/usr/local/lib/python2.7/dist-packages/selector.py", line 137, in __call__
    return app(environ, start_response)
  File "/usr/local/lib/python2.7/dist-packages/tiddlywebplugins/utils.py", line 87, in _require_role
    return handler(environ, start_response, *args, **kwds)
  File "/home/cdent/tiddlywebs/tank.peermore.com/tiddlywebplugins/tank/closet.py", line 47, in closet
    target_name)
  File "/home/cdent/tiddlywebs/tank.peermore.com/tiddlywebplugins/tank/closet.py", line 93, in _regular_tiddler
    tiddler.text = content.decode('utf-8')
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xab in position 252830: invalid start byte

I'm not sure if there's a bug here.

Unrelated: 400s always come back as HTML. Should they respect Accept: text/plain?

cdent commented 9 years ago

Should non utf-8 resources be accepted?

Generally speaking TiddlyWeb takes a pretty hard line about encoding. It expects text-based content to be utf-8 encoded. There are a few different reasons for this:

Should the 400 provide a more informative message?

Yes, was there any message in the body of the 400 response?

Unrelated: 400s always come back as HTML. Should they respect Accept: text/plain?

There's been discussion in the past about doing content negotiation for error responses and the general consensus was it wasn't worth it. In default TiddlyWeb the responses are text/plain, which seems to align okay with what you would expect when using the system mostly as an API and where the status codes are the most meaningful part of the response. Tank overrides this behavior to templatize 400 responses into HTML because that's the level at which the errors matter: when using the API.

It would be possible (probably without too much difficulty) to get it to attend to the accept header but are the semantics there appropriate: Accept is used when things go well to say "I want the resource I've requested to come back as X". Does it also apply for error messages?

In any case, the fact that you're getting an untrapped 500 on drag and drop is a bug, that bit of "closet" code is not being careful when making a call to decode. The expected response there would have been the 400 that you're seeing elsewhere. I'll look into fixing that and in the process see whether the 400 is getting the message it deserves.

We can explore accepting other encodings if you like, but I'd rather people put their files in a reasonable encoding in the first place. What do you think?

/cc @FND

edrex commented 9 years ago

I agree with all this (UTF-8 activism, semantic ambiguity of interpreting Accept for error responses).

My main issue was that the 400 response didn't say why the request was bad.

Old response text:

    <div class="message"> </div>
    <div class="extra">
        You left something out of that request, perhaps go back and
        try again.
     </div>

After 2bd6ca9:

    <div class="message">
      unable to decode tiddler, utf-8 expected: 'utf8' codec can't decode byte 0xab in position 252830: invalid start byte
    </div>
    ...

And the DND upload is correctly throwing a 400 now.

For me this issue seems resolved :tropical_drink:


WRT content negotiation for error messages, the original HTTP/1.1 RFC is fairly clear:

Any response containing an entity-body MAY be subject to negotiation, including error responses.

while the newer HTTP semantics RFC is more coy:

When responses convey payload information, whether indicating a
   success or an error, the origin server often has different ways of
   representing that information.

My feeling is that the server should return HTML errors only if the request has Accept: text/html, since browsers always send this in the initial request. The main reason is that otherwise extracting a detailed error message is difficult (eg for display in a client-side UI as in the DND component above).

edrex commented 9 years ago

I guess to more directly address the semantics of Accept, I think sometimes it means "this is how I would like the resource" (especially with GET requests) and other times it means "these are my preferred message formats for response messages" (for PUT, POST, DELETE, etc).

FND commented 9 years ago

My feeling is that the server should return HTML errors only if the request has Accept: text/html

That seems like a sensible argument. To me, Accept not only means "this is how I would like the resource", but also "these are the formats I can parse" - so if we send HTML even though plain text would be just as expressive, that's kind of nasty. (I'm not advocating parsing error messages themselves, of course, but the client might want to pass them through to the UI, logs etc., so HTML is just noise there.)

cdent commented 9 years ago

So it appears the outcome of this is that it would be nice if tank only sent text/html on errors some of the time. So I've created #138 as a reminder for that.