Sicos1977 / MSGReader

C# Outlook MSG file reader without the need for Outlook
http://sicos1977.github.io/MSGReader
MIT License
476 stars 168 forks source link

Extraneous characters inserted after <img> with external src in BodyHtml #320

Closed nylu closed 1 year ago

nylu commented 1 year ago

Describe the bug

Since version 4.4.11 inline images with external source have extraneous characters after the image. I already checked if the updated dependency OpenMcdf is the reason, but it is not.

To Reproduce Steps to reproduce the behavior:

  1. Have a .msg e-mail containing html with an image pointing to a web source i.e. <img src="https://github.githubassets.com/images/modules/site/home-campaign/astrocat.png"/>
  2. Open this e-mail with MsgReader
  3. Inspect the BodyHtml and see the extraneous characters somewhere behind the <img> tag
  4. Save the BodyHtml to disk
  5. Open this file in chrome and see the extraneous characters

Screenshots defect

Expected behavior No extraneous characters behind the image. expected

Additional context .net 6.0.3 console application

Sicos1977 commented 1 year ago

Seems that part of the RTF is leaking into the HTML

Sicos1977 commented 1 year ago

I still have one day to work before a 3 week vacation in starting, I then try to solve all the open standing issues.

nylu commented 1 year ago

Seems that part of the RTF is leaking into the HTML

Just a comprehension question to this assumption: I thought the .msg file contains the full HTML body as plaintext. You can see it if you view the .msg with a text editor. Don't you use this HTML but instead always convert the RTF to HTML? Or how could the RTF leak into HTML?

I still have one day to work before a 3 week vacation in starting, I then try to solve all the open standing issues.

Thank you for your voluntary work on this topic!

Sicos1977 commented 1 year ago

There is almost never a pure HTML inside an MSG file. Microsoft has a very special way to put the HTML inside the file. It is encoded into RTF (probably to be backwards compatable with very old Exchange systems). So the MSGReaders needs to parse out the HTML from the RTF.

https://learn.microsoft.com/en-us/openspecs/exchange_server_protocols/ms-oxrtfex/83b224b4-6876-4281-a355-f3ffb42e42e8?redirectedfrom=MSDN

Sicos1977 commented 1 year ago

Should be solved in the latest nuget version