eikek / docspell

Assist in organizing your piles of documents, resulting from scanners, e-mails and other sources with miminal effort.
https://docspell.org
GNU Affero General Public License v3.0
1.66k stars 128 forks source link

Broken Text-Encoding in Filenames after Upload #2825

Open pschichtel opened 1 month ago

pschichtel commented 1 month ago

I noticed this recently, but not sure when exactly this started happening, possibly with the 0.42 update.

Any german Umlaute are affected by this, but I assume this might be a general UTF-8 encoding issue.

Here it is in the upload form before uploading: image

Here is the document after uploading without modifying it: image

I feels like a classic case of interpreting UTF-8 as ASCII/ISO 8859-1 on the byte level.

pschichtel commented 1 month ago

image

The POST request uploading the file has the broken name already.

pschichtel commented 1 month ago

I haven't seen an obvious commit that might have broken this.

eikek commented 1 month ago

Oh, that is very strange! I didn't change anything on the client for some time. I'll need to reproduce it here. What browser are you using?

pschichtel commented 1 month ago

I did this on Firefox, but I can also test with chromium.

pschichtel commented 1 month ago

Here is an example file that reproduces this for me in Firefox and Chromium: äöüÄÖÜß.pdf

pschichtel commented 1 month ago

image

It's interesting to see Firefox (left) and Chromium (right) display the characters differently.

nekrondev commented 3 weeks ago

2842 might be related to this (I can approve this behavior with Firefox and 0.42.0, too).

nekrondev commented 3 weeks ago

So it seems that

Here is what will happen doing the same wrong conversion using Python:

>>> bytes("Vertragsübersicht".encode("utf-8")).decode("iso-8859-1")
'Vertragsübersicht'

For reference AI suggested the following code as an example to provide a dedicated header:

encodeURIComponent : String -> String
encodeURIComponent str =
    String.join ""
        (List.map encodeChar (String.toList str))

encodeChar : Char -> String
encodeChar char =
    case char of
        'ä' ->
            "%C3%A4"

        _ ->
            String.fromChar char

fileParts : List File -> List (Http.Part msg)
fileParts files =
    List.map (\f -> 
        let
            filename = "ä.txt"
            encodedFilename = encodeURIComponent filename
            contentDisposition = "form-data; name=\"file[]\"; filename*=UTF-8''" ++ encodedFilename
        in
        Http.filePartWithHeaders "file[]" f [ ( "Content-Disposition", contentDisposition ) ]
    ) files

So somehow parts of this code example must be integrated to

https://github.com/eikek/docspell/blob/fc3629e3b7efd3e68bf29c9804bac03dfeaab53a/modules/webapp/src/main/elm/Api.elm#L1075C9-L1076C60

I've currently no Elm/Scala dev environment setup so can't tinker around with the frontend code, but hope it helps to understand the encoding problems.

PS:

grafik (gibt's hier).

pschichtel commented 3 weeks ago

@nekrondev sounds like a good solution. What I still wonder about: Why did it a become an issue now? Especially since @eikek didn't change anything about the client. The http4s version used since right after the the 0.41.0 release contains this PR, which seems to perfectly explain the behavior: https://github.com/http4s/http4s/pull/7419

nekrondev commented 3 weeks ago

Yea, that PR makes sense and the maintainers are right that filename*= is forbidden for newer HTML5 RFCs. That's why they convert the UTF-8 encoded filename= attribute back to ISO-8859-1 default and allow manual transformation by http4s API methods if a specific conversation is needed. Fixing the ELM web UI by adding filename*= won't work, because it's no longer supported by http4s framework. The burden it takes will be on Dospells backend to re-encode the ISO8859-1 filename string back to UTF-8. The main problem I see here is that you get no reliable information from the browser which encoding should be used. The legacy filename*= provided that information, but the HTML5 RFC mentioned in that https4 PR tells us otherwise that it's forbidden to do so.

The back to UTF-8 encoding I think needs to be fixed here where the multipart filenames are processed.

pschichtel commented 2 weeks ago

The issue that I still see: We don't have any information on what the encoding originally was, right? Would we just blindly reinterpret it as UTF-8 in the hope that things at least don't get worse? The last section of https://datatracker.ietf.org/doc/html/rfc7578#section-4.2 suggest that could be a reasonable approach. It also seems like that's what http4s' filename method defaults to. I'll send a PR.

pschichtel commented 2 weeks ago

@nekrondev @eikek #2853