apache / incubator-ponymail

Apache Pony Mail (Incubating) - Email for Ponies & People
http://ponymail.incubator.apache.org/
Other
80 stars 30 forks source link

Add DKIM style ID generation #517

Open sbp opened 4 years ago

sbp commented 4 years ago

This PR adds DKIM style Ponymail ID generation.

Why?

There are a number of existing Ponymail ID generators, two of which are currently recommended: full for a single node, and cluster for multiple nodes. The purpose of the latter in particular is to generate an ID based on the hash of the email once elements that may vary across cluster nodes have been excluded.

There are other situations in which elements of an email may vary from environment to environment. Once such scenario is that archives of a mailing list collected by different people or organisations in different locations may contain different actual email sources. The Received headers, for example, will be different depending on the routes of emails sent out by the mailing list software.

Variability in emails is a problem in another area too: guaranteeing email authenticity. Authenticity has been studied and gradually solved over the years by technologies such as SPF, DKIM, and ARC. Since DKIM involves signing emails, the designers had to solve email variability in such a way that signatures would be consistent. They did this by allowing signers to (a) subset the original headers before signing, and (b) apply different canonicalisation algorithms to the headers and to the body.

This PR solves the problem of robust ID generation by leveraging the existing DKIM mechanism for robust signing.

What?

To generate a DKIM style Ponymail ID, we first parse the email using a superset of the algorithm used by the popular Python dkimpy package. We then subset the headers to an RFC recommended subset, and apply DKIM relaxed/simple canonicalisation. Finally, we hash the canonicalised subset message using SHA3-256, and encode the first 80 bits of the digest using a custom base32 alphabet. The encoded digest prefix is the resulting Ponymail ID.

How?

We first implemented a superset of the RFC 822 parser in dkimpy (the reference parser) from scratch.

Why did we use this reference parser instead of following an algorithm in an RFC? Neither RFC 822 nor RFC 6376 (the DKIM RFC) nor any other RFC that we found gives a parsing algorithm for inputs which are broken. Ponymail must generate an ID no matter what the form of the input is. Therefore we followed an RFC 6376 implementation as a reference because it already covers some broken inputs.

But we went further. Our version shares no code with the original parser, and has other advantages in that it:

We subset the email using headers recommended by RFC 4871, the precursor to RFC 6376. We do this because the former had a more comprehensive set of recommendations on which headers to include. We sent the following email, which explains the situation in detail, to the authors of RFC 6376:

In RFC 6376 you removed several entries from the list of headers in
_Recommended Signature Content_ (5.4.1) that had been present in RFC
4871. The (Resent-)Sender and (Resent-)Message-ID headers were
removed, as well as all MIME headers.

In an _INFORMATIVE OPERATIONS NOTE_ (5.4) earlier on in RFC 6376,
though, it is "highly advised" that both Sender and all MIME headers
are present. This earlier NOTE is unchanged from RFC 4871.

Later on in RFC 6376 the removal of Message-ID is discussed, and its
new status is explained to be contextual. But what was the rationale
for removing the other headers, and what is their new status? Was the
NOTE accidentally not updated when the other headers were removed from
the list? Or are (Resent-)Sender and the MIME headers still
recommended for inclusion?

RFC 6376, whilst contradicting itself, does at least say that ultimately the choice is up to the implementer. We use the more comprehensive recommendations of RFC 4871 as proving a better baseline. We also add the DKIM-Signature header itself so that signed and unsigned messages cannot have the same ID.

We hash the result with SHA3-256. RFC 6376 and its successors do not yet provide for signatures using SHA3-256. We chose SHA3-256 as it is more likely to be resistant to cryptanalysis than SHA2, and the sponge construction has more practical uses such as STROBE, and therefore SHA3 is more likely to be robust. We could have used SHAKE for a shorter digest to truncate from, but SHA3-256 is more ubiquitous. We use only the first 80 bits, to make IDs easier to share; Ponymail IDs are IDs, not hashes. Although 80 bytes is enough to make collisions in normal use unlikely, collision attacks are still possible. To avoid collision attacks, we allow configuration of a nonce which is added as a header to the email before canonicalisation.

Finally, we encode the output using base32, but with the alphabet [0-9b-df-hj-tv-z] in place of the normal base32 alphabet. We position 0-9 first by analogy with base16. We use a-z without a, e, i, u to minimise the probability that cultural taboo words will appear in generated IDs. We call this the pibble encoding, which is short for "pony nibble".

sbp commented 4 years ago

Please note that this PR also adds an extra parameter passing the bytes of the original message to the compute_updates and generator functions in archiver.py and generators.py respectively. This is because the DKIM style ID generator works on the original bytes, and recovering these from msg.as_bytes() is not only slower but also more fragile in that the behaviour of that method depends on the upstream implementation in the Python standard library.

sebbASF commented 4 years ago

Thanks very much. This looks very good.

However I think the hash won't be sufficiently unique, given that PonyMail uses the hash both for the Permalink and for storing the copy of the email.

It is possible that two separate messages sent to a mailing list can end up with the same hash. AFAICT, the sender has control over all the fields that are checked (assuming the MTA does not generate DKIM headers), so can send mails that are duplicates as far as the DKIM calculation is concerned.

But list software such as ezmlm generally does not care what an email contains and would treat this as a new message, so would generate a new sequence number for the message. This could result in gaps in the PonyMail record, since that uses the hash as a unique id for the message.

There would be no way to know what caused the missing sequence number -- was it such a duplicate, or did the email get lost in transit? -- without access to a separate archive.

I think this problem can be avoided by including some of the last received headers in the hash. The ones after any List-xxx headers added by the list software will relate to the journey the mail took to reach the mail server, so should not be affected by subsequent deliveries to mail archivers.

Note: this is only an issue because PonyMail uses the same hash for Permalinks and the database id.

Humbedooh commented 4 years ago

Isn't that just trading one potential risk (if you will) for another? If you keep any received headers in the permalink ID, you risk losing reproducibility in permalinks for emails that either don't have a list-id, or are sent via multiple lists.

I think the DKIM list makes sense, and would trade the stable generated hash over whether or not you can debug ezmlm with it. ezmlm's return path only applies if your email address is the direct recipient after the list was invoked. If it goes past that, you may lose it, and the index number will be lost anyway. I'm also unsure whether it's Pony Mail's job to debug whether postfix/etc received an email or not.

Having said all that, what I could envision is for the next generation pony mail to use this generator for the permalink, but store multiple copies of the source, as they come in, and referencing them in a metadata table (or just storing the pibble in the source as well, but not use it for the document ID). When viewing the source, you could be presented with all existing options. Thus, you would have just the one copy of the email when viewing the list (which I think makes the most sense in any case), but multiple options when viewing the source.

That is, to sum up, I like this generator and what it accomplishes - I think it looks far better than our previous/current solutions, and if not for this version, I'd be interested in pulling this PR into the next generation of pony mail.

Humbedooh commented 4 years ago

To expand upon my previous comment:

Email A comes in. It gets "pibbled" to abcdefg1234. A SHA3 digest is 123412341234 Email B comes in, identical to A but with a different route. It gets pibbled as abcdefg1234, and the SHA3 digest is now 432143214321 The permalink for BOTH emails would then be abcdefg1234 (as we only ever need that one copy of the contents for basic search/viewing), but because that pibble is stored in both sources (which have different SHA3 digests), you would be able to see both options when wanting to view the source.

sebbASF commented 4 years ago

AFAICT, so long as one only takes into account the Received headers that relate to the hops before arrival at ezmlm, all recipients of the email will be able to generate the same hash. If for some reason the email does not have a list-id or another way of determining whether the Received header was added before arrival at the list server, then ignore the header.

Note that this is not a question of 'debugging' ezmlm. It is a question of knowing why Ponymail does not have a particular email sent by the mailing list. Was it lost in transit, or was it a duplicate?

For example, can you tell me why dev-return-1032-archive-asf-public=cust-asf.ponee.io@ponymail.incubator.apache.org is missing from Ponymail without looking at some other archive?

As to your example, if email A and email B are identical, I don't see how they will get different pibbles/digests unless at least some of the received headers are taken into account. Please explain.

Humbedooh commented 4 years ago

They would have the same pibble, but different SHA3 if the SHA3 is done using the full message source. What headers are/aren't in the source wouldn't matter, as it would refer to the same pibble at any time because of how the DKIM generator works.

As for the return path you mention, that is specific to a specific installation. I don't think we should allow/deny PRs based on just that. I for one don't get such return paths in my mbox source, as it goes through another alias before it hits my inbox - the same could be true for the archiver, in which case it wouldn't matter what the original return path was.

Humbedooh commented 4 years ago

Email A: pibble is abcdefg1234, SHA3 of full message is 123412341234 Email B: pibble is still abcdefg1234, SHA3 of full message is 432143214321 Both have the same pibble, both could show up as variants when you went to look at the source.

When importing from a third source (email C) into a DB from scratch, pibble would again be abcdefg1234, and SHA3 could perhaps be 111222333444, it wouldn't matter as the pibble metadata is the same, so a search would find it in the DB.

Humbedooh commented 4 years ago

Furthermore, blue-skying here, this could be made backwards compatible with older databases easily. For all new sources, store the source document with a pibble field inside (next to the source field). For all old sources, the doc ID is the permalink.

Thus, to figure out if we need to access directly via permalink ID or via a pibble keyword search, we'd just assess whether the ID of the email being accessed is a pibble or not.

Things to consider for later:

sebbASF commented 4 years ago

On closer examination, I see that the DKIM generator has several options as to how the hash is generated.

This means that the generated hash will depend on which options are chosen, rather than just on the mail content.

If for some reason emails have to be reloaded either in the same installation or another, it is vital that the same hashes are generated.

My conclusion is that options cannot be allowed if Permalinks are to be truly permanent.

Humbedooh commented 4 years ago

Color me stupid, but...you would manually have to go in and change those settings to get a different result, would you not? Whether there are options set or not would then not matter, as you could change the behavior however you see fit in any case. What matters is that the defaults, and what is used, stay the same, IMHO.

The nonce is the only option I could see anyone purposefully changing, and that's not defined in the generator itself, but in the pony mail config.

sebbASF commented 4 years ago

Yes, you would have to change the options. However if different instances have different settings, then their hashes won't be the same.

I see the DKIM hash being used as a more reliable Message-Id. This means it must be immutable. Changing the options is akin to changing a Message-Id.

sebbASF commented 4 years ago

The optional nonce makes things worse as there are effectively infinite values it can take. At least with boolean options it would be possible to generate all the different hashes.

Humbedooh commented 4 years ago

If different instance have different settings, then that is the problem of the person that set that up, not us, not the generator's fault. Having options make it easier for others to reuse the code in my view. I don't see it as a fault or a bad thing.

As for the other comment, infinite values is exactly the point of the nonce as I read it. It is to prevent collision attacks from being possible. In my mind, this is a great addition to the generator. If I set up an instance with a nonce, only I can reasonably collide IDs (with a lot of effort still), no one else. If you don't want that extra security, don't set a nonce.

sbp commented 4 years ago

@sebbASF

An 80 bit truncated hash provides 80 bits of preimage resistance, but only 40 bits of collision resistance. In terms of Ponymail, preimage resistance prevents forgery of new messages with existing IDs. Collision resistance prevents a forger generating two different messages with a single new ID.

Back of the envelope calculations, which I am not claiming as the actual security of this PR code (i.e. this must be independently verified), using SUPERCOP benchmarks show that 40 bits enables a forgery under that model in a matter of hours. 64 bit security would be possible only by a nation-state actor. Forgery under 80 bit security is infeasible now, and is likely to remain so for a reasonable duration.

In other words, 80 bit encoded DKIM style IDs are not suitable for use in the way that you suggest, as a nonce-free identifier for abstract emails. They would be far too easily forged, in a matter of hours even with standard hardware. This is why the option for adding a nonce exists.

General forgeries could be mitigated by storing the actual DKIM signature (if present) as metadata within Ponymail, and this would probably be a good idea in any case. Identifier forgeries can be partially mitigated in Ponymail by canonically showing the first message to have been received by the system. But this only works for currently hosted mailing lists; it does not work for message archive imports.

A minimum of 128 bit security is standard for security in general scenarios. That is why 256 bit hashes have become widespread, because they provide 128 bits of collision resistance security. But for Ponymail preimage resistance is the primary concern, and collision resistance less so.

For your use case I would suggest 128 bit identifiers, providing 64 bits of collision resistance. The code could still be written with the idea that collision attacks would be feasible, providing further mitigations even in this very unlikely case. Another advantage of using 128 bits is that SHAKE-128 output could be used untruncated.

The alternative base32 pibble encoding proposed in this PR has the advantages that it is case insensitive and avoids a wide range of cultural taboo substrings, but it is less compact than urlsafe base64, which would be a potential alternative if using longer IDs. Here is an example of a 16 character pibble encoded 80 bit identifier:

t3wfqdtjor8kghwn

And here is a 22 character base64 encoded 128 bit identifier:

MTIzNDU2Nzg5MDEyMzQ1Ng

The hashes of original messages should use 256 bit hashes. They could use SHA3-256 and be stored using urlsafe base64 encoding. Specifically, it is not secure to use 128 bit identifiers because the threat model is different. The DKIM style identifer is a one-to-many mapping, where the many values are preserved. Collisions in that context can be mitigated. The message source identifier is a one-to-one mapping, and so a collision would result in lost data, which is not acceptable.

In summary: if the nonce option is preserved, 80 bit truncated cryptographically secure hash (CSH) identifiers may be suitable for generic messages but not for message sources. If the nonce option is removed, 128 bit CSH identifiers may be suitable for generic messages, but not for message sources. Only 256 bit CSH identifiers are suitable for message sources.

I apologise that you were led astray by the ignis fatuus of the dkim options. They are not user facing options. They are only there for:

The options can even be safely removed if dkim is standardised.

Humbedooh commented 4 years ago

@sbp This sounds like we have two options here then: 1) pibble with 80 bits if nonce is set, 128 bits if no nonce? 2) always use 128 bits for pibbling?

As for using the complete 256 bit CSH for sources, I am fully on board there, but I think this is best saved for the next gen. That doesn't mean the dkim generator isn't suitable for this gen (from what I can tell, it is far superior for clustered environments compared to any other algorithm), but rather that we'd save using the 256 bit CSH for sources for the next generation. Does that make sense?

sebbASF commented 4 years ago

@sbp In the case of the nonce, does the additional security rely on using a variable nonce, or would a fixed nonce be sufficient?

==

Not all emails in archives have List- headers. For example email aliases, and early emails before list managers were common. I think this means that the Permalink must contain the list id separately from the message hash

Humbedooh commented 4 years ago

AIUI, the destination list-id (not the one in the origin, which may not exist) is appended in the generator with:

        headers.append([b"X-Archive-List-ID", xali_value])

the destination list ID is thus both in the hash and also in the mbox document (as it always is). I'm unsure what you mean by the permalink containing the list id. I don't see that as a must, it's in the document itself, and the pibble would change if the destination list ID changed, thus two identical emails for the two separate lists would have different pibbles for each list.

sebbASF commented 4 years ago

I mean that the Permalink should include the list id, as it does at present. For example: aabbcc@<lid>

sbp commented 4 years ago

@sebbASF

One nonce can be used for all messages archived by a host, but it must never be disclosed. It is more accurately called a pepper, which is a secret salt. Once it is set in the configuration it does not have to be changed, but disclosure is catastrophic.

Humbedooh commented 4 years ago

I don't think the ID should include the list name by default, I like it short and neat - makes life easier for people using links :) It could perhaps be an option to append, but I don't really see a need to always have the list ID in the permalink. having a long permalink leads to a worse user experience, and using <@> chars etc often leads to encoding bugs.

sbp commented 4 years ago

@Humbedooh

Collision forgery would require control over entire input messages, unless the source identifier algorithm uses a subset. It also does not enable attacks against the identifiers of existing messages. If a Received header was added and is used to compute the identifier, this increases the difficulty of the attack. If an unpredictable header is added by Ponymail and used, this thwarts attacks even against imported archives. But using a 256 bit CSH of the whole message means you get all the security of the hash and no longer have to threat model such collision forgeries. A 256 bit CSH is cheap and currently reliable security.

It is reasonable to save this feature for the next generation as long as at least one kind of existing message source identifier has enough collision resistance to make these attacks impractical. Since a range of identifiers are available, their security levels could be noted in the documentation so that implementers can understand the security consequences and decide.

sebbASF commented 4 years ago

If an externally provided list-id is included in the hash, then the has will change if the lid changes. Suppose there is an mbox to be imported. If the individual messages all have list-ids there is no need to specify the lid on the command-line. However if you do provide one (which is advisable in case there are any missing ids), then the hashes will be different.

It is vital that the hash depends only on the message source (possibly plus a fixed pepper), otherwise reloading the messages may generate a different hash and thus Permalink.

As to characters such as @<> being a problem, I agree, but there are other characters that could be used to separate the lid from the hash. For example one could use hash_dev.ponymail.apache.org.

I agree that it would be nice to have shorter links, but that should not be at the cost of unstable Permalinks. Now it already looks like it should be possible to use a shorter hash by using base64 so a Permalink of the form:

r3358b63557d3a40e179f6ca498a38b9aaf0b2532aba48bfc03c7a1a0@<dev.project.apache.org> might become MTIzNDU2Nzg5MDEyMzQ1Ng_dev.project.apache.org

Humbedooh commented 4 years ago

Arguably, if you use a custom list ID different from what's in the source, you are going to potentially 404 your permalink in any case if you change it or forget what it was when reimporting, no matter what generator you use. All current generators use the list ID in their input/output. The only difference of real importance, in my view, is that the list ID is visible in current generated IDs, and hidden in the dkim generator to get as short an ID as is reasonably possible (by hidden I mean it's used inside the hash, instead of being plaintext outside the hash).

I will agree that having the list ID in the generated ID can be very helpful for administrators if they need to reimport and had a lot of custom overrides they can't remember or don't have a backed-up configuration of. What if we make this an option for the administrator to decide on? Thus, we could accept this as is, and then have a second, dkim_long or such, where the list-id is appended to the generated ID instead.

the current dkim generator appeals to a certain group of people, and your suggestion of appending the list ID will appeal to other groups of people. Neither solution will be 100.00% stable against all edge cases (take for instance losing your database and re-importing from gmail mbox sources, that would not work).

Let the administrators running PM decide between a shorter ID for neater URLs, or a longer ID that could make recovery easier. But let that be their decision, I don't think we should be imposing one or the other option.

sebbASF commented 4 years ago

"re-importing from gmail mbox sources, that would not work" - why not?

It would certainly work with the mod_mbox software, as that relies on an intrinsic part of the message (Message-Id)

That is surely one of the main ideas behind the dkim hash - generate an id that is the same for all instances of the same original message? It would not be needed if Message-Id were universal and unique, but unfortunately that's not the case.

Humbedooh commented 4 years ago

gmail does some (nasty) normalization of header values, such as lower-casing email addresses, which is not standard practice, so you cannot reliably generate the same hash for all emails if your previous import was based off non-gmail mbox files. I've run into this problem a few times, where the mbox address had caps in it, and gmail removed those caps, hence why I know.

sebbASF commented 4 years ago

In which case maybe dkim should do the same normalisation as GMail to avoid the issue?

Humbedooh commented 4 years ago

I am not aware of an RFC with the changes that GMail employs - in other words, I don't know what they do in addition to lowercasing the sender address, so I can't really say for sure what should happen.

Humbedooh commented 4 years ago

(furthermore, we shouldn't be beholden to proprietary changes outside RFCs)

Humbedooh commented 4 years ago

I am inclined to merge this PR with the fixes required to do so. We can then add a longer version of the output ID for those that wish to include the list ID in the generated document ID.

sebbASF commented 4 years ago

Sorry, but I don't think the discussion is yet complete. We need more reviews by other interested parties. And we need more tests using the same emails from multiple sources.

Humbedooh commented 4 years ago

I'm okay with giving it a few more days for others to review, but at some point we should acknowledge that we have hit the number of reviewers we are going to get. As for testing, I'll see if I can put together something using the same email from different sources. It's a bit out of the current test framework, but shouldn't be too difficult.

Humbedooh commented 4 years ago

I've run 2124 tests (1062 different emails) on dkim using two different sources, and it matches on every single one. I've added it to the unit tests repo.

rbowen commented 4 years ago

I, for one, love having the ID be shorter. The current permalinks are long, cumbersome, and frequently get broken by linebreaks in mail clients, leading me to use link shorteners almost every time I need to use a Ponymail permalink - which kinda defeats the purpose.

I expect I'm missing something but I don't understand the strenuous resistance to this improvement - and it is an improvement.

Definitely +1 to the change.

sebbASF commented 4 years ago

I agree that the generator looks very promising. However, I think it has some issues that need to be solved.

For example, the current implementation includes the lid override in the hash. The override is not an intrinsic part of the source message, so unless the same override is used when reloading, a different id will be generated. This is going to cause problems if lists are renamed. e.g. if one imports an archive for dev@httpd.apache.org, re-importing using the lid 'dev@httpd.apache.org' changes the ids even though it is for the same list. It is advisable to use the lid override, especially when importing older archives which may not have any list headers.

It needs to be tested to see whether it properly distinguishes duplicates. Sometimes MTAs hiccup, and cause the same message to appear twice on a mailing list. These should have different ids.

Does the generator work well with mails that don't have list headers? For example, aliases and older mail archives.

Also the hash strength has yet to be decided; is the current length sufficient?

Humbedooh commented 4 years ago

If lists are renamed, then the permalinks would change on a re-import in any case, so I consider the argument moot. It's not the generator's job to secure against bad backup planning. We expect the same input to generate the same output - if that input changes, we are in no way guaranteeing the same output, and that holds true for all our generators. If you rename a list, the permalink doesn't change, but it would if you re-imported with the new list ID.

As for the same message appearing twice, I can see cases where it could cause duplicates (if included headers differ), but that isn't a desired feature for the end user. In my opinion, it should de-dup messages where the contents are the same, but allow for both sources to be present in the system. This needs to be solved either outside the scope of this PR, or by some general changes to all generators, wherein the source ID is distinct from the permalink. It shouldn't create two different permalinks for the same message.

Humbedooh commented 4 years ago

As for Does the generator work well with mails that don't have list headers?, I see no indication that it would make any difference if a header is present or not.

The pibble length can be debated, but we have to balance the requirement of the administrators and that of the end users. The current length is good enough that random collisions are unlikely to happen (1:10^24 chance of that). It is short enough that people will use the permalinks and not use a URL shortener. I'm not knowledgeable enough to know about the precise preimage attack chances here.

I am however satisfied that this generator works as intended, and improves upon what we have to offer. If that means going from 16 bytes in the pibble to 24, for added preimage security, I'd be okay with that. What I would dislike is doing a full length digest, as that discourages using permalinks in their raw form, which is counterproductive for permalinks.

Humbedooh commented 4 years ago

@sbp what are your thoughts on the pibble length? Is it safe as is, does it need to be longer?

rbowen commented 4 years ago

So ... it sounds like the concerns have been considered, addressed, and tested for?

Like I said, I really look forward to more manageable permalinks.

sbp commented 4 years ago

@sebbASF

It is not yet documented why the command line list ID would need to be present in the permalink. Am I right in thinking that the following is the only use case?

Consider an mbox archive whose messages contain no list IDs in common with the command line list ID imposed by the administrator. All of its messages in Ponymail are later lost, but the original mbox archive file is still available. Since the messages in Ponymail were lost, the command line list ID is also lost. But since the command line list ID was present in permalinks, if a user of the list has that any permalink available to them then the command line list ID can be recovered.

I can think of no other use case.

There are far better data recovery strategies available. One could, for example, maintain a mapping of command line list IDs to any individual DKIM IDs only contained within that list. This is suitable in the case where an entire archive is expected to be recovered. Such a mapping file would be extremely small, on the order of KiB, and would therefore be easily replicated across many systems.

If only individual messages are expected to be recovered, then the mapping of command line list IDs to all DKIM IDs would be necessary. This would only require storing sixteen bytes for every email in the system, so even an archive with a million emails would only require a mapping file of about 15 MiB.

Even in the original suboptimal strategy, it is not necessary to make the command line list ID a mandatory part of a permalink. It could instead be made optional, like labels used in Amazon URLs, some weblog software, and on some news sites, as the following examples demonstrate:

https://www.amazon.com/Apache-Definitive-Guide-Ben-Laurie/dp/0596002033
https://www.amazon.com/Anything-Can-Go-Here/dp/0596002033
https://www.amazon.com/dp/0596002033

https://lobste.rs/s/j7p2ow/what_are_you_doing_this_week
https://lobste.rs/s/j7p2ow/anything_can_go_here
https://lobste.rs/s/j7p2ow

https://www.reuters.com/article/apache-moves-on-traffic-server-machine-learning-projects-idUS57202199920100504
https://www.reuters.com/article/anything-can-go-here-idUS57202199920100504
https://www.reuters.com/article/idUS57202199920100504

Amazon and Reuters use an infix pattern, whereas Lobsters uses a suffix pattern. Users could strip the Ponymail list ID, whether command line or archive metadata derived, from the permalink:

https://lists.apache.org/thread/MTIzNDU2Nzg5MDEyMzQ1Ng/dev.project.apache.org
https://lists.apache.org/thread/MTIzNDU2Nzg5MDEyMzQ1Ng/anything.can.go.here
https://lists.apache.org/thread/MTIzNDU2Nzg5MDEyMzQ1Ng

Or if the malleability of anything.can.go.here is undesirable, the UI software could ensure that the message actually appears in the list ID in the optional part of the URL. But I think that, as @rbowen noted, the first thing any user wants to do with a URL that's too long to easily share is to shorten it, either by taking out optional components or by submitting it to a link shortener.

Links are themselves UI, and they ought to be designed in a user friendly way.

Links which are too long are not user friendly, and this is why sites use IDs like 0596002033, j7p2ow, and idUS57202199920100504, to recapitulate the actual examples mentioned above. They don't use mandatory IDs like MTIzNDU2Nzg5MDEyMzQ1Ng_dev.project.apache.org. Even MTIzNDU2Nzg5MDEyMzQ1Ng could be regarded as too long, but unlike Amazon, Lobsters, and Reuters we have the constraint that we would like to be able to generate the ID again from the content, which means using a hash, which means considering the hash security; and indeed I provided an informal analysis earlier in this thread.

I would very much like wider review and more discussion of this pull request. I notice, however, that the 40 or so messages, from four contributors, currently in this thread compares rather favourably to the following number of messages in the threads of all previous PRs on Ponymail:

0, 0, 0, 5, 2, 3, 0, 2, 1, 0, 3, 3, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 2, 0, 2, 2, 1, 0, 3, 2, 10, 4

Combined, this is 54 messages across every single PR, merged or unmerged. I count that 17 out of 35 PRs were merged. I also counted the number of participants in the threads of only the merged PRs, giving the following figures:

1, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 3, 3, 2, 3, 2, 2

I notice, therefore, that the current PR already far exceeds the amount of review of all existing PRs, almost surpassing their combined number of messages, and that the number of contributors to the thread already surpasses that of every existing merged PR.

Despite this, I repeat the call for wider review. Clearly this is a substantial contribution, and many of the prior PRs were trivial. I would especially like, for example, somebody to audit the behaviour of my algorithm vs the reference algorithm in the dkimpy package, and to provide a more formal analysis of the security parameters of the hash.

It is also clear that this PR needs to be modified before it can be accepted. As I understand it, the following modifications could aid consensus:

It would also be useful if objecting participants would concisely state all of their remaining objections.

sebbASF commented 4 years ago

This PR affects various different aspects of generation:

In order to determine its suitability, it first needs to be decided whether the hash is to be used for Permalinks and database ids or just Permalinks, as this will affect the fields that must be taken into account.

If it is to be used only as a Permalink, then it does not matter if a few non-identical messages generate the same hash, so long as the system can display all matches. Obviously the chance of any duplicates should be minimised. I think the current design of the PR largely fulfils that requirement. [There are some edge cases where a single message is sent to a list twice that still need to be investigated. This can occur anywhere between the originator and the mailing list software e.g. if a retransmission occurs (I hope to add some examples to the test corpus soon). There are also some lists with aliases and mails have been copied to the alias. These appear as a separate emails on the list.]

However if it is also used as a database id, then the hash must be unique for different messages.

As to the hash presentation, shorter is better for Permalinks, and the charset is important. However for the internal database id those are not really a concern.

sebbASF commented 4 years ago

The following test: https://github.com/apache/incubator-ponymail-unit-tests/blob/master/yaml/gens-ponymail-dev-1079-1080.yaml shows that currently the generator produces the same hash for 2 different entries on the list.

Admittedly they are the same original message, but it would be better if the generator could distinguish them.

If the dkim generator is to be used for the database id then it is essential that it produces different results.

Humbedooh commented 4 years ago

Assuming the emails are the same with all the same elements according to the DKIM rules, I disagree that it's essential with two IDs - I think they should be the same if the origin email is the same unless it's sent to different lists. I believe de-duplicating this is the best choice for the end user.

I'm not interested in duplicates of the same message in any place other than sources for "cyber-archeological digging". To me this is a matter of choice, I don't see it as a technically broken or incorrect measure. The return path should not be a part of this algorithm.

sbp commented 4 years ago

@sebbASF

The test that you linked to also produces equal hashes when modified to use the current Ponymail recommended cluster generator. The modified test is included below for independent verification. Given that cluster and dkim exhibit the same behaviour, why is cluster included and explicitly recommended but dkim regarded as unsuitable?

generators:
  # Identical mails, different archivers
  # N.B. Should be different from below
  corpus/ponymail-dev-1079-mailarchiesao.mbox:
  corpus/ponymail-dev-1079-mboxvm.mbox:
    cluster:
    - index: 0
      message-id: <CAOwvvydHMY2oa8t4m+UWi8N4o+3JgdqSxujqRueP+OoiNW1EQg@mail.gmail.com>
      generated: ra504b0e636a1ac3de84a3372744b5738bc6a9e34a28645cf94855d5f@<dev.ponymail.apache.org>
  # Identical mails, different archivers
  # N.B. Should be different from above
  corpus/ponymail-dev-1080-mailarchiesao.mbox:
  corpus/ponymail-dev-1080-mboxvm.mbox:
  corpus/ponymail-dev-1080-listsao.mbox:
    cluster:
    - index: 0
      message-id: <CAOwvvydHMY2oa8t4m+UWi8N4o+3JgdqSxujqRueP+OoiNW1EQg@mail.gmail.com>
      generated: ra504b0e636a1ac3de84a3372744b5738bc6a9e34a28645cf94855d5f@<dev.ponymail.apache.org>
sebbASF commented 4 years ago

I do not personally recommend the cluster generator. Its output depends on the parsed message and how attachments are processed. This is not stable over time.

The dkim generator is better than any existing one we have seen so far. It's more a question of whether it can be improved further. If not, so be it, but I think this needs to be determined before proceeding with what may be a temporary solution.

sbp commented 4 years ago

@sebbASF

There are several improvements in the pipeline, including:

These improvements are very nearly complete, waiting to be pushed.

The dkim generator can therefore indeed be improved, but if there are any further suggestions as to what could be better about the approach or the algorithm then it would be helpful if they were noted in this thread.

The aim of the improvements is partly to make the algorithm less idiosyncratic, transforming it into a consistent method which could hypothetically be written up as an RFC or other standalone specification for canonicalised message identifiers.

sebbASF commented 4 years ago

Thanks very much. As already indicated, I think the approach is very good.

There are some areas that may need tweaking, for example which headers are included in the hash, and how the hash is turned into a string for use as a Permalink. These should not affect the basic algorithm.

sbp commented 4 years ago

Commit dfd18eb contains the improvements previously mentioned.

The generator has been renamed to DKIM-ID with the function name dkimid, and most of its code has been moved to a new file called dkim_id.py. This file is not only used as a module by generators.py, but can also be used as a command line script to generate the DKIM-ID of its input.

Moving the code to a new file also facilitated the creation of the extensive test suite in dkim_id_test.py. This test suite uses a combination of doctest and hypothesis to cover many properties of the DKIM-ID code. Although hypothesis is a useful way to test code, it is not a property prover, only a fuzzer. Moreover, properties can still be missed or specified incorrectly. In short, hypothesis is not magic and bugs may remain.

The code has been formatted using isort and black, and checked using flake8 and mypy --strict, in addition to the doctest and hypothesis tests. The doctest documentation in dkim_id_test.py is formatted in reStructuredText, and exported to HTML in the file dkim_id_test.html.

The custom base32 encoding is the same, but X-Archive-List-ID and X-Archive-Nonce have been removed. The DKIM style algorithm has been changed in a variety of minor ways for greater compliance to the relevant RFCs. The reference code that it is based on is now libopendkim, written in C, rather than dkimpy as the former has greater RFC compliance and is also likely more widely used. The DKIM-ID output is now 128 bits, up from 80 bits.

Despite there being many more lines of code, mostly due to improved modularity and documentation, the DKIM-ID algorithm is in many ways simplified compared to the initial version in this pull request.

Thanks to all the commenters who helped to motivate the improvement of this code.

sebbASF commented 4 years ago

I'm just working my way through the new documentation, which seems very thorough.

There is a section in dkim_id_test.html which states that a mbox From line may be confused with a header. It says that:

From MAILER-DAEMON Fri Jul 8 12:08:34 2011

could be interpreted as a header with field name:

From MAILER-DAEMON Fri Jul 8 12

According to the original rfc822 and current rfc5322, header field-names cannot contain spaces, so it's not possible to confuse a line that starts with "From text...:" for a valid header.

[There is an obsolete syntax that allows trailing white-space just before the colon, but that could not match a valid mbox header which must have some printable chars before the colon]

sebbASF commented 4 years ago

We should consider canonicalising 'From ' lines in the message body.

This is because it's not possible to ensure that an email imported from an mbox file will have 'From ' lines in its body correctly unprefixed such that it matches the original.

I think the only way to handle this is to remove all the '>' prefixes before 'From ', i.e. something like

s/\n>+From /\nFrom /

The result may not be the same as the original email, but it would remove all instances of prefixing.

Note: this is only intended to apply to the hash input, not the raw source as stored when archiving or importing.

Humbedooh commented 4 years ago

While I like the idea, I think that's moving beyond the scope of this PR - it sounds like something you could add a flag for, potentially a second generator that normalizes this....but ultimately something we shouldn't force upon this PR.