Open pschichtel opened 1 month ago
The POST request uploading the file has the broken name already.
I haven't seen an obvious commit that might have broken this.
Oh, that is very strange! I didn't change anything on the client for some time. I'll need to reproduce it here. What browser are you using?
I did this on Firefox, but I can also test with chromium.
Here is an example file that reproduces this for me in Firefox and Chromium: äöüÄÖÜß.pdf
It's interesting to see Firefox (left) and Chromium (right) display the characters differently.
So it seems that
Content-Disposition
mime multipart header. This will result into a wrong formatted Docspell filename, because http4s
will decode the UTF-8 formatted string as ISO-8859-1 (that's what RFC 5987 tells, see https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Disposition)Here is what will happen doing the same wrong conversion using Python:
>>> bytes("Vertragsübersicht".encode("utf-8")).decode("iso-8859-1")
'Vertragsübersicht'
filename*=UTF-8''
attribute inside the Content-Disposition
header. Not being an Elm programmer myself, but asking our AI for assistance there should be a method for the Http
package called Http.filePartWithHeaders
that should allow Elm to provide a dedicated Content-Disposition
adding the nowadays default UTF-8 encoding for the filename.For reference AI suggested the following code as an example to provide a dedicated header:
encodeURIComponent : String -> String
encodeURIComponent str =
String.join ""
(List.map encodeChar (String.toList str))
encodeChar : Char -> String
encodeChar char =
case char of
'ä' ->
"%C3%A4"
_ ->
String.fromChar char
fileParts : List File -> List (Http.Part msg)
fileParts files =
List.map (\f ->
let
filename = "ä.txt"
encodedFilename = encodeURIComponent filename
contentDisposition = "form-data; name=\"file[]\"; filename*=UTF-8''" ++ encodedFilename
in
Http.filePartWithHeaders "file[]" f [ ( "Content-Disposition", contentDisposition ) ]
) files
So somehow parts of this code example must be integrated to
I've currently no Elm/Scala dev environment setup so can't tinker around with the frontend code, but hope it helps to understand the encoding problems.
PS:
(gibt's hier).
@nekrondev sounds like a good solution. What I still wonder about: Why did it a become an issue now? Especially since @eikek didn't change anything about the client. The http4s version used since right after the the 0.41.0 release contains this PR, which seems to perfectly explain the behavior: https://github.com/http4s/http4s/pull/7419
Yea, that PR makes sense and the maintainers are right that filename*=
is forbidden for newer HTML5 RFCs. That's why they convert the UTF-8 encoded filename=
attribute back to ISO-8859-1 default and allow manual transformation by http4s API methods if a specific conversation is needed. Fixing the ELM web UI by adding filename*=
won't work, because it's no longer supported by http4s
framework. The burden it takes will be on Dospells backend to re-encode the ISO8859-1 filename string back to UTF-8. The main problem I see here is that you get no reliable information from the browser which encoding should be used. The legacy filename*=
provided that information, but the HTML5 RFC mentioned in that https4 PR tells us otherwise that it's forbidden to do so.
The back to UTF-8 encoding I think needs to be fixed here where the multipart filenames are processed.
The issue that I still see: We don't have any information on what the encoding originally was, right? Would we just blindly reinterpret it as UTF-8 in the hope that things at least don't get worse? The last section of https://datatracker.ietf.org/doc/html/rfc7578#section-4.2 suggest that could be a reasonable approach. It also seems like that's what http4s' filename method defaults to. I'll send a PR.
@nekrondev @eikek #2853
I noticed this recently, but not sure when exactly this started happening, possibly with the 0.42 update.
Any german Umlaute are affected by this, but I assume this might be a general UTF-8 encoding issue.
Here it is in the upload form before uploading:
Here is the document after uploading without modifying it:
I feels like a classic case of interpreting UTF-8 as ASCII/ISO 8859-1 on the byte level.