dinhvh / libetpan

Mail Framework for C Language
www.etpan.org
Other
612 stars 283 forks source link

mailmime_encoded_phrase_parse could support additional wrong encoding #173

Closed 77tb closed 9 years ago

77tb commented 9 years ago

To: "=?UTF-8?B?5ZOO5ZGA5oiR5Y676IGU57O75Lq65ZCN5a2X6L+Y6IO96L+Z5LmI6ZW/5ZGi?= =?UTF-8?B?5ZOO5ZGA5oiR5Y676IGU57O75Lq65ZCN5a2X6L+Y6IO96L+Z5LmI6ZW/5ZGi?=" 260919069@qq.com

dinhvh commented 9 years ago

It parses to that result for me: 哎呀我去联系人名字还能这么长呢哎呀我去联系人名字还能这么长呢" 260919069@qq.com

Is it correct?

yimingtang commented 9 years ago

@dinhviethoa :100:

dinhvh commented 9 years ago

Therefore, it's already fixed.

77tb commented 9 years ago

libetpan v1.6 has not fixed this issue.

Here is the test data (\r\n\t are also been shown):

To: \r\n\t"=?UTF-8?B?5ZOO5ZGA5oiR5Y676IGU57O75Lq65ZCN5a2X6L+Y6IO96L+Z5LmI6ZW/5ZGi?= =?UTF-8?B?5ZOO5ZGA5oiR5Y676IGU57O75Lq65ZCN5a2X6L+Y6IO96L+Z5LmI6ZW/5ZGi?="\r\n

dinhvh commented 9 years ago

Could you rather show the complete headers? Thank.

77tb commented 9 years ago

Here is the whole message

2014-11-25 1:17 GMT+08:00 Hoà V. DINH notifications@github.com:

Reopened #173 https://github.com/dinhviethoa/libetpan/issues/173.

— Reply to this email directly or view it on GitHub https://github.com/dinhviethoa/libetpan/issues/173#event-197731665.

Received: from m13-14.163.com (unknown [220.181.13.14])
    by newmx31.qq.com (NewMx) with SMTP id 
    for <260919069@qq.com>; Fri, 14 Nov 2014 09:23:29 +0800
X-QQ-FEAT: N4pdkPNmNYfWkSDVicpofM700uUqoAb56P/Tpx3TB8A=
X-QQ-MAILINFO: ODvMQTwIxEi3mvh/zEzWeVm+NhiJRbUpSit7Ss4nCkyE5RWSMTyy5ghjp
    7QpwHryTzkDlf/JCoayaEC7PlusAzS+LT0BFO0/m2aP2Tdro2XOsDPq0xZG3Hc=
X-QQ-SSF: 0031000000000011016000010200601
X-QQ-mid: mx31t1415928209tmcfswu7v
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=163.com;
    s=s110527; h=Date:From:Subject:MIME-Version:Message-ID; bh=b6/f8
    fuX3tAZGapV19ag1zeQFe685+uVfo/ARWqFS8A=; b=FlmrRRnrNh7l/cWqOid1I
    269jo57XJXwyf+45LcWNiUDs7pGkkSOs1F3hykiYGuUtRN5UGJDKAsZwBRfMqSlT
    +IlYVfgKP2nLR6CjIdDgGrj4LnU0oJ23jbZpSfsrjiyd+jd8RlfNWpMRgo4FMYeV
    9doIx6HosFLPknPWIYVgEo=
Received: from lxtestfor1$163.com ( [223.252.194.101, 123.58.177.191] ) by
 ajax-webmail-wmsvr14 (Coremail) ; Fri, 14 Nov 2014 09:23:27 +0800 (CST)
X-Originating-IP: [223.252.194.101, 123.58.177.191]
Date: Fri, 14 Nov 2014 09:23:27 +0800 (CST)
From: lxtestfor1  <lxtestfor1@163.com>
To: 
    "=?UTF-8?B?5ZOO5ZGA5oiR5Y676IGU57O75Lq65ZCN5a2X6L+Y6IO96L+Z5LmI6ZW/5ZGi?= =?UTF-8?B?5ZOO5ZGA5oiR5Y676IGU57O75Lq65ZCN5a2X6L+Y6IO96L+Z5LmI6ZW/5ZGi?=" <260919069@qq.com>
Subject: Desert
X-Priority: 3
X-Mailer: Coremail Webmail Server Version SP_ntes V3.5 build
 20140915(28949.6690) Copyright (c) 2002-2014 www.mailtech.cn 163com
X-CM-CTRLDATA: lsbi22Zvb3Rlcl9odG09MTM2MToyODg=
Content-Type: multipart/alternative; 
    boundary="----=_Part_298085_1819449493.1415928207804"
MIME-Version: 1.0
Message-ID: <79453ade.120e4.149abe5d9bd.Coremail.lxtestfor1@163.com>
X-CM-TRANSID:DsGowABnhsKQWWVU1LYwAA--.5508W
X-CM-SenderInfo: 5o0wv2xwir2ii6rwjhhfrp/1tbiJxdEulEAJdPvGwABsc
X-Coremail-Antispam: 1U5529EdanIXcx71UUUUU7vcSsGvfC2KfnxnUU==

------=_Part_298085_1819449493.1415928207804
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: base64

CgoKCgoKLS0K77u/6L+Z5piv57qi6Imy55qEIO+7v+i/meaYr+m7keS9kyDov5nmmK/mlpzkvZPv
u78gIOi/meacieS4i+WIkue6vwoKCui/mei/mOacieW9k+WJjeaXpeacnzIwMTTlubQxMeaciDE0
5pelCgoKCgoKCgo=
------=_Part_298085_1819449493.1415928207804
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: base64

PGRpdiBzdHlsZT0ibGluZS1oZWlnaHQ6MS43O2NvbG9yOiMwMDAwMDA7Zm9udC1zaXplOjE0cHg7
Zm9udC1mYW1pbHk6QXJpYWwiPjxkaXYgc3R5bGU9ImxpbmUtaGVpZ2h0OjEuNztjb2xvcjojMDAw
MDAwO2ZvbnQtc2l6ZToxNHB4O2ZvbnQtZmFtaWx5OkFyaWFsIj48ZGl2IHN0eWxlPSJsaW5lLWhl
aWdodDoxLjc7Y29sb3I6IzAwMDAwMDtmb250LXNpemU6MTRweDtmb250LWZhbWlseTpBcmlhbCI+
PGJyPjxicj48YnI+PGJyPjxicj48ZGl2Pi0tPGJyPjxzcGFuIHN0eWxlPSJjb2xvcjogcmdiKDEz
NiwgMCwgMCk7Ij7vu7/ov5nmmK/nuqLoibLnmoQmbmJzcDs8c3BhbiBzdHlsZT0iY29sb3I6IHJn
YigxMjgsIDAsIDEyOCk7Ij48Yj7vu7/ov5nmmK/pu5HkvZMgPGkgc3R5bGU9ImNvbG9yOiByZ2Io
MCwgMCwgMCk7IGJhY2tncm91bmQtY29sb3I6IHJnYigyNTUsIDI1NSwgMjU1KTsiPjxzcGFuIHN0
eWxlPSJjb2xvcjogcmdiKDAsIDAsIDApOyI+6L+Z5piv5pac5L2TPC9zcGFuPjxzcGFuIHN0eWxl
PSJjb2xvcjogcmdiKDAsIDAsIDApOyI+77u/ICZuYnNwOzx1Pui/meacieS4i+WIkue6vzwvdT48
L3NwYW4+PC9pPjwvYj48L3NwYW4+PC9zcGFuPjxkaXY+PHNwYW4gc3R5bGU9ImNvbG9yOiByZ2Io
MTM2LCAwLCAwKTsiPjxzcGFuIHN0eWxlPSJjb2xvcjogcmdiKDEyOCwgMCwgMTI4KTsiPjxiPjxp
IHN0eWxlPSJjb2xvcjogcmdiKDAsIDAsIDApOyBiYWNrZ3JvdW5kLWNvbG9yOiByZ2IoMjU1LCAy
NTUsIDI1NSk7Ij48c3BhbiBzdHlsZT0iY29sb3I6IHJnYigwLCAwLCAwKTsiPjx1Pjxicj48L3U+
PC9zcGFuPjwvaT48L2I+PC9zcGFuPjwvc3Bhbj48L2Rpdj48ZGl2PjxzcGFuIHN0eWxlPSJjb2xv
cjogcmdiKDEzNiwgMCwgMCk7Ij48c3BhbiBzdHlsZT0iY29sb3I6IHJnYigxMjgsIDAsIDEyOCk7
Ij48Yj48aSBzdHlsZT0iY29sb3I6IHJnYigwLCAwLCAwKTsgYmFja2dyb3VuZC1jb2xvcjogcmdi
KDI1NSwgMjU1LCAyNTUpOyI+PHNwYW4gc3R5bGU9ImNvbG9yOiByZ2IoMCwgMCwgMCk7Ij48dT7o
v5nov5jmnInlvZPliY3ml6XmnJ88L3U+PC9zcGFuPjwvaT48L2I+PC9zcGFuPjwvc3Bhbj4yMDE0
5bm0MTHmnIgxNOaXpTwvZGl2PjxkaXY+PGJyPjwvZGl2PjxkaXY+PGEgdGFyZ2V0PSJfYmxhbmsi
IGhyZWY9Imh0dHA6Ly8xNjMuY29tIj48aW1nIHNyYz0iaHR0cDovL2ltZzQucGljYmVkLm9yZy91
cGxvYWRzLzIwMTQvMDQvaW1hZ2VzKDIpLmpwZyI+PC9hPjwvZGl2PjwvZGl2PjwvZGl2Pjxicj48
YnI+PHNwYW4gdGl0bGU9Im5ldGVhc2Vmb290ZXIiPjxzcGFuIGlkPSJuZXRlYXNlX21haWxfZm9v
dGVyIj48L3NwYW4+PC9zcGFuPjwvZGl2Pjxicj48YnI+PHNwYW4gdGl0bGU9Im5ldGVhc2Vmb290
ZXIiPjxzcGFuIGlkPSJuZXRlYXNlX21haWxfZm9vdGVyIj48L3NwYW4+PC9zcGFuPjwvZGl2Pjxi
cj48YnI+PHNwYW4gdGl0bGU9Im5ldGVhc2Vmb290ZXIiPjxzcGFuIGlkPSJuZXRlYXNlX21haWxf
Zm9vdGVyIj48aHIvPgo8ZGl2IHN0eWxlPSJmb250LXNpemU6MTRweDtjb2xvcjojNjY2O2xpbmUt
aGVpZ2h0OjEuNjY2Ij7mnIDlpb3orrDnmoTpgq7nrrHvvJrmiYvmnLrlj7fnoIFAMTYzLmNvbTxi
ci8+5peg6ZyA5rOo5YaM77yM55+l6YGT5omL5py65Y+35bCx6IO957uZ5LuW5Y+R6YKu5Lu2Cjxh
IGhyZWY9Imh0dHA6Ly9zaG91amkuMTYzLmNvbS9tb2JpbGVtYWlsL2hvbWUuZG8/ZnJvbT1xbWFp
bCIgdGFyZ2V0PSJfYmxhbmsiPuS6huino+ivpuaDhSZndDsmZ3Q7PC9hPgo8L2Rpdj4KPC9zcGFu
Pjwvc3Bhbj4=
------=_Part_298085_1819449493.1415928207804--
pitiphong-p commented 9 years ago

I think I found the cause of the bug. The algorithm in mailmime_encoded_phrase_parse function is wrong. Currently this function will extract the encoded data and then decode the encoded data line by line and then concatenate the result in the end. I think the correct way is we need to extract the encoded data every line concatenate the encoded data then decode it finally.

dinhvh commented 9 years ago

Could you show me the result of libetpan and what you're expecting? Thanks.

pitiphong-p commented 9 years ago

Subject: =?UTF-8?B?4Lij4Liw4LmA4Lia4Li04LiU4LiE4Lin4Liy4Lih4Lih4Lix4LiZ4Liq4LmM?= =?UTF-8?B?4LmA4LiV4LmH4Lih4Lie4Li04LiB4Lix4LiUIFRSQU5TRk9STUVSUyA0IOC4?= =?UTF-8?B?oeC4seC4meC4quC5jOC4hOC4o+C4muC4l+C4uOC4geC4o+C4sOC4muC4miDg?= =?UTF-8?B?uJfguLXguYjguYDguJTguLXguKLguKfguYPguJnguYDguKHguLfguK3guIfg?= =?UTF-8?B?uYTguJfguKI=?=

Expected result: ระเบิดความมันส์เต็มพิกัด TRANSFORMERS 4 มันส์ครบทุกระบบ ที่เดียวในเมืองไทย libetpan result: ระเบิดความมันส์เต็มพิกัด TRANSFORMERS 4 ?ันส์ครบทุกระบบ ??ี่เดียวในเมือง??ทย

dinhvh commented 9 years ago

I think MIME RFC (RFC 2047) is describing each item as "encoded word" and therefore, each word should be decoded separately. Then, mailmime_encoded_phrase_parse has the correct behavior.

Of course, it could be improved to support wrong encoding such as the one you shown here.

pitiphong-p commented 9 years ago

Each Thai character needs 3 bytes in UTF-8 encoding. I think some character might be encoded into two lines so when you decode the data line by line, we will get incomplete character UTF-8 data. So I think we should combine the encoded data into 1 line first then decode them in the end.

dinhvh commented 9 years ago

Sure. That's what happen for wrong MIME encoding. Proper MIME encoding won't break a character on several two lines.

And of course, a good improvement would be to take care of those wrong encoding.

pitiphong-p commented 9 years ago

Just read the RFC in detail about encoded-word. That's great. Thank you for the explanation.

dinhvh commented 9 years ago

I don't. If you have some implementation in mind of how to fix it, you could send a pull request. Keep in mind that encoding for the words could be different.

pitiphong-p commented 9 years ago

I'm not sure what do you mean that encoding for the words could be different?

dinhvh commented 9 years ago

I meant the charset encoding: (=?ISO-8859-1?Q?a?= =?ISO-8859-2?Q?_b?=)

pitiphong-p commented 9 years ago

I see. I still don't have any idea on the fix now. Need some time to think about it though.