andyedinborough / aenetmail

C# POP/IMAP Mail Client
369 stars 153 forks source link

Charset in imap doesn't work correctly #48

Open paolosanchi opened 12 years ago

paolosanchi commented 12 years ago

i think something should be done in the internal void SetBody(string value) method.. because in the value that is assigned to Body has wrong characters: latin characters like 'à' and 'ò' are converted to '?'

paolosanchi commented 12 years ago

Ok,i analyzed the problem and studied a solution (this article explain how charsets and text encoding works: http://www.joelonsoftware.com/articles/Unicode.html) The general problem is that the email is decoded from the stream thinking it as encoded in UTF8. ImapClinet.cs line 422

    while (remaining > 0) {
      read = _Stream.Read(buffer, 0, Math.Min(remaining, buffer.Length));
      body.Append(System.Text.Encoding.UTF8.GetString(buffer, 0, read));
      remaining -= read;
    }

This should true for the headers, but the email content is encoded using the encoding specified in the Content-Type header of the email, like this: Content-Type: text/html; charset=UTF-8

That's not all, because the content could be of this type: Content-Type: multipart/alternative; that means that the body could have different rappresentations such as text/plain or text/html and it could be encoded using a different encoding like the ISO-8859-1,

Content-Type: text/plain; charset="ISO-8859-1"

The real problem if we get the string of the content encoded in ISO-8859-1 using the UTF8 decoder we loose information, because if the body contains culture specific characters (like òèàùàè) it interprets them as '?'.

Store the RawBody as a string is not bad, as we know the c# strings have 16bit per char (they are unicode), but just before the mail.Load(body.ToString(), headersonly); in the GetMessages() method we have to use the right Decoder for the right part and have no wrong character at all.

At this point there is another problem, because the implicit operator that cast a MailMessage do not care about the encoding at all. the Attachment.GetData() method is wrong, and the attachment.ContentType is wrong too, because they do not care of the original encoding of the various parts..

I found for my purpose a working solution (a workaroud), it was simple because utf8 has the character of my language.

I hope that these considerations may help someone find a smarter solution, because unfortunately I do not have time to do it, now.

piher commented 12 years ago

So you say you found a way to work with accentuation ?

meehi commented 12 years ago

reporcello you are right

this line is wrong: body.Append(System.Text.Encoding.UTF8.GetString(buffer, 0, read));

it should look like something like this: body.Append(System.Text.Encoding.GetEncoding(charset).GetString(buffer, 0, read));

charset is a string variable and should be take its value from the body ContentType. I have tested and complied again the component and now Latin1 characters (like acute unicode characters) are looking fine.

nakhli commented 12 years ago

These seems to be related to closed issue 49. Do you still have this problem with latest version?

meehi commented 12 years ago

I still have this problem with the latest version. Issue #49 does not fix it. In my previous comment I have added a sample code logic how it should work properly. You might want to check it out.

meehi commented 12 years ago

And I think I have duplicated the problem here: #54

paolosanchi commented 12 years ago

I did some change in my local version that solved the problem for west european languages, because utf-8 is compatible with that. My solution is pretty brutal: i read the email 2 times, the first just for search the string ISO-8859-1, if i find it i will use the utf-8 decoder, otherway i use the ISO-8859-1 (from pages). The email shouldn't be red using just one encoder, we should be able to switch to the proper one when we find the "Content-Type:" lable let me know

meehi commented 12 years ago

reporcello: I use the same approach as you do but with a little tune up. I don't hard code the codepage rather search for it in body and use it dinamically. Here you can find the complete solution for what I use on local: https://github.com/andyedinborough/aenetmail/issues/54

piher commented 12 years ago

Maybe we could start by reading the bytes as ASCII, then when we encounter a "=?something?" or a "charset=" (or any other header specifying encoding) we switch to the specified encoding and read the bytes. We could some sort of byte-matching as we know the bytes representing the end of line in headers and the bytes representing the "charset=".

meehi commented 12 years ago

This is a working solution: https://github.com/andyedinborough/aenetmail/issues/54#issuecomment-4591205

I have tested on many Latin1 and UTF8 character encoded mails and it has decoded all of them without problem.

It needs further testing and some adjustment.

jstedfast commented 10 years ago

The only way to truly solve issues like this is to write a parser that doesn't require the message data to be converted into a unicode string first. In other words, the MIME parser needs to parse byte arrays.

See MimeKit for an example of a MIME parser that does this.