barbushin / php-imap

Manage mailboxes, filter/get/delete emails in PHP (supports IMAP/POP3/NNTP)
MIT License
1.66k stars 459 forks source link

UTF-8 characters getting problems #657

Closed JiriSch closed 2 years ago

JiriSch commented 2 years ago

Thank you for this library, I am using it for years, I was using up to version 3.1.0 but now I wanted to upgrade to 4.5 ang get problems with UTF-8 characters.

Email with UTF-8 characters - these characters are unreadable.

The used code:

    $mailbox = new PhpImap\Mailbox('{xxxxxx:143/imap4/notls/novalidate-cert}INBOX', 'xxxxx', 'xxxx');
    var_dump($mailbox->getServerEncoding()); //---GETTING UTF-8---
    $mailsIds = $mailbox->searchMailbox('ALL');
    if(!$mailsIds) return 0;
    foreach ($mailsIds as $key=>$mailsId)
    {
        $mail = $mailbox->getMail($mailsId);
        //var_dump($mail);

        var_dump($mail->textHtml);
    }

I tried to test any version newer than 3.1.0 but without success, only till 3.1.0 works.

Thank you.

JiriSch commented 2 years ago

I add some more debug info, textPlain downloaded over 3.1.0:

With interpunction: ěščřžýáíé Without: escrzyaie

The same email downloaded over any version newer, for ex 4.5.1:

With interpunction: �������� Without: escrzyaie

In this attachment is the raw email from IMAP server (downloaded from IMAP server by SCP), sended by MS Outlook, recieved by Postfix, IMAP use DOVECOT: EmailTest.txt

JiriSch commented 2 years ago

I have fixed this issue by changing:

foreach ($elements as $element) { $newString .= $this->convertToUtf8($element->text, $element->charset); }

to:

foreach ($elements as $element) { if ($element->charset!='default') $this->decodeMimeStrDefaultCharset = $element->charset; $newString .= $this->convertToUtf8($element->text, $element->charset); }

in Mailbox.php:1490

Not sure if this is the right solution, but it works for me...

JiriSch commented 2 years ago

Just to explain, it looks like

$elements = \imap_mime_header_decode($string);

returns in $elements->charset default in case it's not, so this handle this from another elements...

christianasche commented 2 years ago

I can confirm this bug.

Email parts with i.e. iso-8859-1 (Windows-1252) encoding are not utf-8, but iso-8859-1. The suggested patch works for me.

I looks like there has been a related merge and revert on Nov 16, 2020:

551 #558

More Info in this discussion: https://github.com/barbushin/php-imap/issues/525#issuecomment-690995876

I hope this helps.

christianasche commented 2 years ago

I tested it with quite a lot emails. Most iso-8859-1 emails are fixed with the above patch. However, I still found some iso-8859-1 emails, that get 'default' as charset from imap_mime_header_decode in decodeMimeStr.

Maybe it is better to fix it this way in DataPartInfo.php:111? At least this works for me with all tested emails.

    protected function convertEncodingAfterFetch(): string
    {
        if (isset($this->charset) && !empty(\trim($this->charset))) {
            $this->data = $this->mail->decodeMimeStr(
                (string) $this->data // Data to convert
            );

+            $this->data = $this->mail->convertToUtf8( 
+                $this->data, 
+                $this->charset 
+            );
+            $this->charset = 'utf-8';
        }

        return (null === $this->data) ? '' : $this->data;
    }
hrst commented 2 years ago

I tested it with quite a lot emails. Most iso-8859-1 emails are fixed with the above patch. However, I still found some iso-8859-1 emails, that get 'default' as charset from imap_mime_header_decode in decodeMimeStr.

Maybe it is better to fix it this way in DataPartInfo.php:111? At least this works for me with all tested emails.

I can confirm, this fixes encoding issues for me.

I think this library desperately needs unit tests or similar that load a set of test emails and then checks whether they encoded correctly. During the last versions, encoding issues that were already fixed reappeared multiple times.

Sebbo94BY commented 2 years ago

Thanks for your code improvements, testings and feedback.

I've created a merge request with your changes, but I need to check, if we can implement it like this and if we instead can / need to remove something else as it doesn't make any sense anymore then.

Feel free to comment and provide feedback on the above linked merge request. :)

Yes, I agree, we need some unit tests, but I'm unfortunately not that familiar with writing unit tests, so I'm not really sure, how we can properly test this. Feel free to create a merge request with some or let me know, how we can define such a good unit test, then I'll add it to the above merge request.

Edit: We already have some decoding tests: https://github.com/barbushin/php-imap/blob/master/tests/unit/MailboxTest.php#L696-L731

They are may just not working as expected or we need to test this somehow else? Or we need more test cases? 🤔

Sebbo94BY commented 2 years ago

I've just released a new version 4.5.3 with the above fix and some unit tests.