Webklex / php-imap

PHP-IMAP is a wrapper for common IMAP communication without the need to have the php-imap module installed / enabled. The protocol is completely integrated and therefore supports IMAP IDLE operation and the "new" oAuth authentication process as well.
https://www.php-imap.com
MIT License
289 stars 137 forks source link

$message->getTextBody() retrieves whole source, not just the plain text message #413

Open TonyMarston opened 1 year ago

TonyMarston commented 1 year ago

I have a message containing plain text, no html and no attachments, but when I use ->getTextBody() it returns the entire source code and not just the message text. The source code is as follows:

Return-Path: Delivered-To: gmx@tonymarston.co.uk Received: from ion.dnsprotect.com by ion.dnsprotect.com with LMTP id oPy8IzIke2Rr4gIAzEkvSQ (envelope-from ) for ; Sat, 03 Jun 2023 07:29:54 -0400 Return-path: Envelope-to: gmx@tonymarston.net Delivery-date: Sat, 03 Jun 2023 07:29:54 -0400 Received: from [::1] (port=48740 helo=ion.dnsprotect.com) by ion.dnsprotect.com with esmtpa (Exim 4.96) (envelope-from ) id 1q5PSF-000nPQ-1F for gmx@tonymarston.net; Sat, 03 Jun 2023 07:29:54 -0400 MIME-Version: 1.0 Date: Sat, 03 Jun 2023 07:29:54 -0400 From: radicore To: gmx@tonymarston.net Subject: Test Message User-Agent: Roundcube Webmail/1.6.0 Message-ID: X-Sender: radicore@radicore.org Content-Type: text/plain; charset=US-ASCII; format=flowed Content-Transfer-Encoding: 7bit X-From-Rewrite: unmodified, already matched This is just a test, so ignore it (if you can!) Tony Marston I expect it to return just the plain text message, not the entire email.
Webklex commented 1 year ago

Hi @TonyMarston,

Thanks a lot for reporting this issue. I really appreciate it! However, in order to help you out, it would be great if you could provide an anonymized version of the problematic message. Without that, it's quite tough for me to debug the issue accurately.

If you're using an older version of the library, I recommend updating to the latest version and giving it another shot. There's a chance that the problem might have already been fixed in the newer release.

Once again, thanks for taking the time and effort to make this library better! If you have any more questions or need further assistance, feel free to let me know.

Best regards and happy coding!

TonyMarston commented 1 year ago

Here is the email in question

d0ac7f3d4ffc4d3f01ab38e92fc001ed@radicore.org.zip

Webklex commented 1 year ago

Hi @TonyMarston , many thanks for the quick followup. Unfortunately I'm unable to replicate the behavior (see the referenced commit above).

Best regards and happy coding,

TonyMarston commented 1 year ago

I am afraid that your unit test is not following the same path through the code as when I run it. I have stepped through the same message with my debugger several times and it is failing to extract the text message from the raw body in exactly the same place. This is the path through the code that I have observed:

query.php, $query->getMessageByMsgn(); query.php, $query->getMessage(); message.php; $message->__construct(); message.php; $message->parseBody(); message.php; $message->parseRawBody(); structure.php; $structure->parse(); structure.php; $structure->find_parts();

It is in the find_parts() method that the code is failing to separate the text message from the raw body. Does your unit test follow the same path through the code?

Webklex commented 1 year ago

Hi @TonyMarston , the test is pretty similar:

Which version are you currently using?

Best regards & happy coding,

TonyMarston commented 1 year ago

I am using 5.3. I see you have just released version 5.4. I shall install that and try again.

TonyMarston commented 1 year ago

I have just tried 5.4 with the same result. When it gets to Part::find_parts the contents of $this->header is not null, so it sets $body = $this->raw which is then becomes $this->content. It is Part::find_parts which is not extracting the test message out of the raw body.

Webklex commented 1 year ago

I updated the sample - in order to make sure I didn't screw up the initial sample and added a live mailbox test. You could try to enable the debug mode inside your config - even if unlikely, but perhaps this brings some insight. Besides this I'm out of ideas..

Out of curiosity:

Best regards,

TonyMarston commented 1 year ago

I am using Windows 10 on my local PC, I am not running on a remote host. My PHP version is 8.2.7

I have stepped through with my debugger again and I see that the problem lies in the Structure class. The constructor calls Structure::parse which in turn calls Structure::find_parts, but this only returns a single part which contains the raw body as it cannot separate the body text from the raw raw body. This is because the raw body only contains a single Content-Type which is "text/plain; charset=US-ASCII; format=flowed" - notice that there is no 'multipart' - and as there is no boundary the code cannot use this to extract the message text from the raw body, so it uses the whole of the raw body which includes the header.

The script I use to call your library is attached. scan_email_inbox(batch).zip

Webklex commented 1 year ago

I see, thanks for the code:

If you try the following:

$folder->query()->all()->chunked(function($messages, $page) {
    foreach ($messages as $message) {
        /** @var Message $message */
        var_dump([
                 'uid' => $message->uid,
                 'subject' => $message->subject,
                 'text' => $message->getTextBody()
             ]);
    }
}, 10, 1);

..does this change anything?

On a side note; you can use (string)$message->subject instead of $message->subject->get() or just treat any message attribute as string / array. Both are supported :)

Unfortunately I can't run tests on windows, but I have tested it with PHP 8.2.7 as well.

Best regards & happy coding,

TonyMarston commented 1 year ago

I have tried changing IMAP::ST_UID to IMAP::ST_MSGN but it makes no difference. I have tried switching from Query::getMessageByMsgn() to Query::getMessageByUid() but it makes no difference. I have enable debug mode but I cannot see any output. I have tried inserting the code you suggsted, but getTextBody() still returns the entire raw body and not just the body text.

I can only repeat what I said in an earlier post - I have stepped through with my debugger again and I see that the problem lies in the Structure class. The constructor calls Structure::parse which in turn calls Structure::find_parts, but this only returns a single part which contains the raw body as it cannot separate the body text from the raw raw body. This is because the raw body only contains a single Content-Type which is "text/plain; charset=US-ASCII; format=flowed" - notice that there is no 'multipart' - and as there is no boundary the code cannot use this to extract the message text from the raw body, so it uses the whole of the raw body which includes the header.

In this particular email the code is incapable of separating out the text body from the raw body as it cannot identify a usable boundary.

TonyMarston commented 1 year ago

I have searched through your code and cannot find anywhere where it extracts text which starts with 'Content-Type: text/plain' and which, because it does not have 'multi-part', does not have a boundary. I have fixed this myself by amending the contents of the findParts() method inside file structure.php (see attached zip file) Structure.zip