SeasideSt / Seaside

The framework for developing sophisticated web applications in Smalltalk.
MIT License
518 stars 71 forks source link

Add option to WAResponse>>#document: to send documents without (re)encoding them #1358

Open eMaringolo opened 1 year ago

eMaringolo commented 1 year ago

In some cases it is needed to send the content of a WADocument without encoding it, even if the MIME Type is text based (as in, not binary), or if the MIME Type specifies a character encoding which matches that of the active codec.

This is needed in some cases where the content of a file library is created from a UTF-8 encoded file (e.g. a CSS file), but the content is saved as a String literal, using the String only as a container of the UTF-8 bytes. Causing a double conversion when serving that file.

Also, it might be the case that a user uploaded a text file (text/csv) whose contents were saved to disk using the raw contents (it is not possible to know which encoding the uploaded file has), and it is expected to be sent back as it was uploaded, regardless of the codec used (and to also avoid a possible double or mis conversion).

WAResponse>>#document: aDocument binary: aBoolean
  | document mimeType isBinary |
  document := aDocument seasideMimeDocument.
  self attachmentWithFileName: document fileName.
  mimeType := document mimeType.
  self contentType: mimeType.

(isBinary := aBoolean or: [mimeType isBinary]) ifTrue: [self binary].

 self nextPutAll: (
   "this checks whether the content is not binary, but already encoded using the same codec"
  (isBinary not 
    and: [mimeType charSet ~= self requestContext codec name]) 
      ifTrue: [document content]
      ifFalse: [document content asByteArray])

Maybe there is a way to refactor WAResponse>>#document: to use the new method.

In some platforms it might require that GRCodecStream>>#nextPutAll: checks whether the argument is a ByteArray, and do not encode it.

E.g.

GRCodecStream>>#nextPutAll: aStringOrSymbolOrByteArray

    stream nextPutAll: (
        aStringOrSymbolOrByteArray isByteArray
            ifTrue: [aStringOrSymbolOrByteArray]
            ifFalse: [codec encode: aStringOrSymbolOrByteArray])

We tested this in VAST and works without breaking anything else, we will likely create a pull request to integrate this.

As a side note, I think that GRCodecStream should always produce ByteArrays as output, but I know that'd be a major change.

svenvc commented 1 year ago

I am not a Seaside specialist, let alone on any port, but I don't think it is Seaside's job to fix encoding issues or other trouble.

If you have a String that is actually UTF-8 encoded bytes, then the problem is how you got there, it should be fixed there and then.

Obviously Seaside is capable of serving any binary file correctly (as it does in its file libraries in lots of variations), so technically anything should be possible.

eMaringolo commented 1 year ago

I am not a Seaside specialist, let alone on any port, but I don't think it is Seaside's job to fix encoding issues or other trouble.

This is exactly the rationale behind the issue, you should be able to send a WADocument with whatever content and encoding you want and Seaside should not alter it. By default it encodes it, but if you want to send it as is you should have an option.

If you have a String that is actually UTF-8 encoded bytes, then the problem is how you got there, it should be fixed there and then.

I agree with this as well... but if WAFileLibrary compiles as a return String the extensions for which it interprets as non-binary MIME types, then you're on Seaside's hands. Compiling everything read from disk as ByteArray should have been the initial choice, but here we are...

svenvc commented 1 year ago

This is exactly the rationale behind the issue, you should be able to send a WADocument with whatever content and encoding you want and Seaside should not alter it. By default it encodes it, but if you want to send it as is you should have an option.

I am pretty sure that already works, after all that is what the file library and handler already do (i.e. taking bytes, as for an image and serving them unaltered with any mime type). At least as far as I can see in Pharo / Seaside 3.

Are you sure this is not related the VAST Seaside port/implementation ? Did you try anywhere else ?

eMaringolo commented 1 year ago

I did not try, but now I did and I noticed GRPharoPlatform>>readFileStreamOn:do:binary: forces the input stream to be valid UTF-8, so anything other than valid UTF-8 cannot be read. So If I want to read an ISO-8859-1 or Windows 949 (Korean) encoded file in Pharo, it doesn't work (I tried).

I guess it is because it forces a "MIME type" to be compiled as a String (or WideString), instead of being a ByteArray. So in Pharo, for UTF-8 encoded files, each character in the literal compiled string will be a Character with a valid Unicode codepoint (without any clustering), which when reencoded in the output will produce the same UTF-8 bytes.

I'll think in how to work around this, maybe the changes must be applied ONLY in the VAST adaptor layer, as I don't foresee anything changing GRPharoPlatform>>readFileStreamOn:do:binary: to read non UTF-8 encoded files.

Thanks for the input.

marschall commented 4 months ago

using the String only as a container of the UTF-8 bytes

Sorry to be that guy but this should be avoided. If you're just sending bytes then String is the wrong abstraction.

and it is expected to be sent back as it was uploaded, regardless of the codec used

This is very error prone. You're relying on the downloader magically getting the same encoding as the uploader.

As a work around you may try something like

WAResponse new
    binary;
    document: ( 
        WAMimeDocument
            on: aByteArray
            mimeType: (WAMimeType fromString: 'text/csv'))

If it works it's only because we do not yet have an explicit #text mode.

eMaringolo commented 4 months ago

Sorry to be that guy but this should be avoided. If you're just sending bytes then String is the wrong abstraction.

I agree, but there were historical reasons for it, if you look at GRCodec>>encodedStringClass you'll find it returns String. As a note, In the next version of VAST we changed that to be ByteArray.

As for your workaround, I think it could work perfectly, and it's similar to our recommendation to one of our customers who reported this. Additionally, I believe it would be beneficial and non-disruptive to have an explicit option to send content 'as is' regardless of the MIME type.

marschall commented 4 months ago

I agree, but there were historical reasons for it

And we have spent years trying to slowly move away from it. It's not behaviour we want to encourage.

I believe it would be beneficial and non-disruptive to have an explicit option to send content 'as is' regardless of the MIME type.

Yeah, but not by pumping it through the image.