SpamScope / mail-parser

Tokenizer for raw mails
https://pypi.python.org/pypi/mail-parser
Apache License 2.0
368 stars 87 forks source link

headers with the same name get clobbered #58

Closed dfeinzeig closed 3 years ago

dfeinzeig commented 5 years ago

some headers, such as Authentication-Results, can occur multiple times in a message. the current code clobbers previous values.

it seems like the following places need to be updated to support headers having lists of values found in a message. all headers could have values that are lists, or only the ones that have more than one value.

the current code uses email.message.get() but needs to use email.message.get_all(). https://docs.python.org/3/library/email.message.html#email.message.EmailMessage.get_all

netr0m commented 5 years ago

:+1: Came here to say this. This applies for parsed .msg files (parse_from_file_msg) as well

netr0m commented 5 years ago

In the mean time, you can solve this for Received headers by:

headers = message.headers
headers['Received'] = [recv.replace('\n', '') for recv in message.received_raw]

For the other headers, I don't think you'll be able to currently, due to the lack of _raw for others.

dfeinzeig commented 5 years ago

@mortea15 are you seeing this happen for received headers? The receiveds property looks like it uses message.get_all().

fedelemantuano commented 3 years ago

I need to have a raw mail to test it.

If you still have this problem, open another issue.

dfeinzeig commented 3 years ago

you just need to duplicate a header with a different value...

Return-Path: <suvorov.s@nalg.ru>
Delivered-To: kinney@noth.com
Received: (qmail 11769 invoked from network); 22 Aug 2016 14:23:01 -0000
Received: from smtprelay0207.b.hostedemail.com (HELO smtprelay.b.hostedemail.com) (64.98.42.207)
  by smtp.server.net with SMTP; 22 Aug 2016 14:23:01 -0000
Received: from filter.hostedemail.com (10.5.19.248.rfc1918.com [10.5.19.248])
    by smtprelay06.b.hostedemail.com (Postfix) with ESMTP id 2CC378D014
    for <kinney@noth.com>; Mon, 22 Aug 2016 14:22:58 +0000 (UTC)
Received: from DM6PR06MB4475.namprd06.prod.outlook.com (2603:10b6:207:3d::31)
 by BL0PR06MB4465.namprd06.prod.outlook.com with HTTPS id 12345 via
 BL0PR02CA0054.NAMPRD02.PROD.OUTLOOK.COM; Mon, 1 Oct 2018 09:49:22 +0000
Received: from DM3NAM03FT035.eop-NAM03.prod.protection.outlook.com
 (2a01:111:f400:7e49::205) by CY4PR0601CA0051.outlook.office365.com
 (2603:10b6:910:89::28) with Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384) id 15.20.1185.23 via Frontend
 Transport; Mon, 1 Oct 2018 09:49:21 +0000
X-Session-Marker: 6A64617A657940616C6578616E646572736D6974682E636F6D
X-Spam-Summary: 69,4.5,0,,d41d8cd98f00b204,suvorov.s@nalg.ru,:,RULES_HIT:46:150:152:379:553:871:967:989:1000:1254:1260:1263:1313:1381:1516:1517:1520:1575:1594:1605:1676:1699:1730:1747:1764:1777:1792:1823:2044:2197:2199:2393:2525:2560:2563:2682:2685:2827:2859:2911:2933:2937:2939:2942:2945:2947:2951:2954:3022:3867:3872:3890:3934:3936:3938:3941:3944:3947:3950:3953:3956:3959:4425:5007:6001:6261:6506:6678:6747:6748:7281:7398:7688:8599:8824:8957:9009:9025:9388:10004:10848:11604:11638:11639:11783:11914:12043:12185:12445:12517:12519:12740:13026:14149:14381:14658:14659:14687:21080:21221:30054:30055:30065:30066,0,RBL:none,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fn,MSBL:0,DNSBL:none,Custom_rules:0:0:0,LFtime:5,LUA_SUMMARY:none
X-HE-Tag: print38_7083d7fd63e24
X-Filterd-Recvd-Size: 64695
X-Test-Key: value1
X-Test-Key: value2
Received: from computer_3436 (unknown [43.230.105.145])
    (Authenticated sender: jdazey@alexandersmith.com)
    by omf06.b.hostedemail.com (Postfix) with ESMTPA
    for <kinney@noth.com>; Mon, 22 Aug 2016 14:22:52 +0000 (UTC)
From: =?UTF-8?B?0YHQu9GD0LbQsdCwINCk0J3QoSDQlNCw0L3QuNC40Lsg0KHRg9Cy0L7RgNC+0LI=?= <suvorov.s@nalg.ru>
To: kinney@noth.com
Subject: =?UTF-8?B?0L/QuNGB0YzQvNC+INGD0LLQtdC00L7QvC3QtQ==?=
fedelemantuano commented 3 years ago

This was an issue of headers function but mail is correct.

def get_header(message, name):
    """
    Gets an email.message.Message and a header name and returns
    the mail header decoded with the correct charset.

    Args:
        message (email.message.Message): email message object
        name (string): header to get

    Returns:
        str if there is an header
        list if there are more than one
    """

    headers = message.get_all(name)
    log.debug("Getting header {!r}: {!r}".format(name, headers))
    if headers:
        headers = [decode_header_part(i) for i in headers]
        if len(headers) == 1:
            # in this case return a string
            return headers[0].strip()
        # in this case return a list
        return headers
    return six.text_type()

Get get_header can get string and list, but I didn't use it in headers. It was a bug. Thanks a lot for your issue.