Decoding the subject with iconv_mime_decode()

freescout-helpdesk commented 1 year ago

Describe the bug

Currently Webklex/php-imap by default uses 'utf-8' decoder and \imap_utf8() function to decode subject:https://github.com/Webklex/php-imap/blob/abf080eb745e21df7a3070ca356248dda021d4ad/src/Header.php#L424

This approach does not always decode subjects properly.

For example Subject: =?ISO-2022-JP?B?GyRCIXlCaBsoQjEzMhskQjlmISEhViUsITwlRyVzGyhCJhskQiUoJS8lOSVGJWolIiFXQGxMZ0U5JE4kPyRhJE4jURsoQiYbJEIjQSU1JW0lcyEhIVo3bjQpJSglLyU5JUYlaiUiISYlbyE8JS8hWxsoQg==?= will be decoded into garbled string:

$B!yBh(B132$B9f!!!V%,!<%G%s(B&$B%(%/%9%F%j%"!W@lLgE9$N$?$a$N#Q(B&$B#A%5%m%s!!!Z7n4)%(%/%9%F%j%"!&%o!<%/![(B

So it does not make much sense to use by default the approach which is not 100% reliable.

Some time ago we've switched to decoding the subject simply with iconv_mime_decode():

$value = iconv_mime_decode($value, ICONV_MIME_DECODE_CONTINUE_ON_ERROR, "UTF-8");

It works always and properly - https://github.com/freescout-helpdesk/freescout/issues/2965#issuecomment-1534134241.

So maybe it worth making iconv_mime_decode() the default and the only approach for decoding the subject like this: https://github.com/freescout-helpdesk/freescout/blob/master/overrides/webklex/php-imap/src/Header.php#L208

Function: https://github.com/freescout-helpdesk/freescout/blob/dist/app/Misc/Mail.php#L909 (decodeSubject())

Webklex commented 1 year ago

Hi @freescout-helpdesk , many thanks for your report and suggestion. I improved the decoder and added a new test case for the mentioned subject above.

Best regards & happy coding,

nuernbergerA commented 1 year ago

@Webklex Attachment::decodeName() has the same problem

Webklex commented 1 year ago

Hi @nuernbergerA , thanks for your report. I just pushed a new test including your mentioned scenario but can't reproduce the described issue. Are you using the latest release? If not, please update and try again. There's a chance that the problem might have already been fixed in a newer release.

If the issue persists, please head over to https://github.com/Webklex/php-imap/issues/new?assignees=&labels=&projects=&template=bug_report.md and provide as much information as possible. Every little bit, even if it might seem unrelated could help to solve the mystery :)

Best regards & happy coding,

nuernbergerA commented 1 year ago

@Webklex i've opened a PR with a failing test for my issue ;)

freescout-helpdesk commented 1 year ago

@Webklex i've opened a PR with a failing test for my issue ;)

We've checked. This our function https://github.com/freescout-helpdesk/freescout/blob/dist/app/Misc/Mail.php#L883C8-L883C8 also successfully works for decoding headers and attachment names including your example:

=?iso-8859-1?Q?2021=5FM=E4ngelliste=5F0819306.xlsx?= >> 2021_Mängelliste_0819306.xlsx

Webklex commented 1 year ago

@nuernbergerA thanks for your input and your pull request. Please update to v5.5 and give it another try :)

@freescout-helpdesk how are you decoding attachment names / filenames? If you are relying on Attachment::decodeName(), please note that this if statement was reversed: https://github.com/Webklex/php-imap/blob/5.5.0/src/Attachment.php#L296-L300 The fallback was used as default, which works often but can cause unexpected behavior (https://github.com/Webklex/php-imap/pull/421) in some cases :/

Best regards & happy coding,

freescout-helpdesk commented 1 year ago

@freescout-helpdesk how are you decoding attachment names / filenames?

We are using the same our MailHelper::decodeSubject() function to decode attachment names - it works fine.

jackwiy commented 2 months ago

@Webklex

$personal = '=?gb2312?B?z+O42yCDfN14u80=?='; imap_mime_header_decode() The result obtained by this method：�� |�x��

public function testUTF8Content()
    {
        $personal = '=?gb2312?B?z+O42yCDfN14u80=?=';
        $personal1 = imap_mime_header_decode($personal);

        echo '<pre>';
        print_r(['personal1'=>$personal1]);

        $header = new Header('');
        $personalParts = $header->mime_header_decode($personal);
        $personal2 = '';
        foreach ($personalParts as $p) {
            print_r(['pText'=>$p->text]);
            $personal2 .= $header->convertEncoding($p->text, $header->getEncoding($p));
        }

        print_r(['personal2'=>$personal2]);
    }

jackwiy commented 2 months ago

public function testUTF8Content()
    {
        $personal = '=?gb2312?B?z+O42yCDfN14u80=?=';

        $personal2 = imap_utf8($personal);
        print_r(['personal2'=>$personal2]);

        $header = new Header('');
        $personalParts = $header->mime_header_decode($personal);
        $personal3 = '';
        foreach ($personalParts as $p) {
            print_r(['pText'=>$p->text]);
            //$personal3 .= $header->convertEncoding($p->text, $header->getEncoding($p));
            $personal3 .= iconv($header->getEncoding($p), 'UTF-8//IGNORE', $p->text);
        }

        print_r(['personal3'=>$personal3]);
    }

thin-k-design commented 2 weeks ago

I have a related problem. I received an email with an attachment, whose filename is "iso-8859-1''Elk%FCld%F6tt%20felsz%F3l%EDt%E1s%20ELK-F-24011370.pdf"

The filename gets decoded as Elk�ld�tt felsz�l�t�s ELK-F-24011370.pdf

After url decoding the string it should be converted to UTF-8:

            if (str_contains($name, "''")) {
                $parts = explode("''", $name);
                if (EncodingAliases::has($parts[0])) {
                    $name = implode("''", array_slice($parts, 1));
                }
            }
            [...]
            // check if $name is url encoded
            if (preg_match('/%[0-9A-F]{2}/i', $name)) {
                $name = urldecode($name);
            }

It should be something like:

            if (str_contains($name, "''")) {
                $parts = explode("''", $name);
                if (EncodingAliases::has($parts[0])) {
                    $charset = $parts[0];
                    $name = implode("''", array_slice($parts, 1));
                }
            }
            [...]
            // check if $name is url encoded
            if (preg_match('/%[0-9A-F]{2}/i', $name)) {
                $name = urldecode($name);
                if (isset($charset)) {
                    $name = mb_convert_encoding($name, 'UTF-8', $charset);
                }
            }

I am not sure of the implications, so I haven't done a pull request yet.

thin-k-design commented 2 weeks ago

Actually it seems that the first part of the code is not equivalent to the RFC 2231 Specification, where a string can be represented like charset'language'encoded-text, in my case iso-8859-1''Elk%FCld%F6tt%20felsz%F3l%EDt%E1s%20ELK-F-24011370.pdf which translates in:

charset: iso-8859-1
language: empty
encoded-text: Elk%FCld%F6tt%20felsz%F3l%EDt%E1s%20ELK-F-24011370.pdf

So I think, this:

           if (str_contains($name, "''")) {
               $parts = explode("''", $name);
               if (EncodingAliases::has($parts[0])) {
                   $name = implode("''", array_slice($parts, 1));
               }
           }

may be replaced with:

            $parts = explode("'", $name, 3);
            if (EncodingAliases::has($parts[0])) {
                list($charset, $language, $name) = $parts;
            }

thin-k-design commented 2 weeks ago

My pull request did not pass the following test: AttachmentLongFilenameTest.php self::assertEquals("f7b5181985862431bfc443d26e3af2371e20a0afd676eeb9b9595a26d42e0b73", hash("sha256", $attachment->filename));

The assertion in the test assumes that the output filename of iso-8859-15''%30%31%5F%41%A4%E0%E4%3F%3F%3F%3F%40%5A%2D%30%31%32%33%34%35%36%37%38%39%2D%71%77%65%72%74%79%75%69%6F%70%61%73%64%66%67%68%6A%6B%6C%7A%78%63%76%62%6E%6D%6F%70%71%72%73%74%75%76%7A%2D%30%31%32%33%34%35%36%37%38%39%2D%71%77%65%72%74%79%75%69%6F%70%61%73%64%66%67%68%6A%6B%6C%7A%78%63%76%62%6E%6D%6F%70%71%72%73%74%75%76%7A%2D%30%31%32%33%34%35%36%37%38%39%2D%71%77%65%72%74%79%75%69%6F%70%61%73%64%66%67%68%6A%6B%6C%7A%78%63%76%62%6E%6D%6F%70%71%72%73%74%75%76%7A%2E%74%78%74 is in its original encoding and not in utf-8.

Maybe I am missing something, but should this be the intended behaviour (leaving the original encoding as is), how can I get the encoding of the filename in the Attachment class, so I can convert it myself if needed?

Webklex / php-imap

Decoding the subject with iconv_mime_decode() #410