bbottema / outlook-message-parser

A Java parser for Outlook messages (.msg files)
76 stars 35 forks source link

Further improve X500 addresses detection, don't overwrite existing address but do retain X500 address if available #73

Closed sanastasiadis closed 5 months ago

sanastasiadis commented 5 months ago

To detect X500 addresses, the regular expression is used: "/o=[^/]+/ou=[^/]+(?:/cn=[^/]+)?" This regular expression accepts only one appearance of "cn=".

It is possible to have X500 addresses with multiple "cn=" parts, eg: "/o=Test/ou=Unit/cn=Recipients/cn=anassta". In this case the above regular expression will not work.

Proposed solution is to change the regular expression to search for "one or more" "cn=" parts, by changing the final "?" into "+". Eg: "/o=[^/]+/ou=[^/]+(?:/cn=[^/]+)+".

I tested with the proposed regular expression and it works for both cases:

  1. "/o=Test/ou=Unit/cn=Recipients/cn=anassta"
  2. "/o=Test/ou=Unit/cn=anassta"
bbottema commented 5 months ago

Hi, thanks for helping out!

To detect X500 addresses, the regular expression is used: "/o=[^/]+/ou=[^/]+(?:/cn=[^/]+)?" This regular expression accepts only one appearance of "cn=".

To be specific, it means it accepts at most only one. So it's optional.

Proposed solution is to change the regular expression to search for "one or more" "cn=" parts, by changing the final "?" into "+". Eg: "/o=[^/]+/ou=[^/]+(?:/cn=[^/]+)+".

This, however, makes it mandatory (one or more).

I guess, the correct way to handle both the case you found and to be backwards compatible, is to make it zero or more. So

/o=[^/]+/ou=[^/]+(?:/cn=[^/]+)*

What do you think?

sanastasiadis commented 5 months ago

Yes, you are right. Using "*" was an afterthought for me as well. I will align my PR.

bbottema commented 5 months ago

Change released in 1.13.1

bbottema commented 5 months ago

I may have to revert this change, as it caused a test case in Simple Java Mail to fail.

java.lang.AssertionError: Expecting actual: [Recipient{name='Sven Sielenkemper', address='sielenkemper@otris.de', type=To}, Recipient{name='sven.sielenkemper@gmail.com', address='sven.sielenkemper@gmail.com', type=To}] to contain exactly (and in same order): [Recipient{name='Sven Sielenkemper', address='/o=otris/ou=Exchange Administrative Group (FYDIBOHF23SPDLT)/cn=Recipients/cn=ad5e65c5e53d40e2825d738a39904682-sielenkemper', type=To}, Recipient{name='sven.sielenkemper@gmail.com', address='sven.sielenkemper@gmail.com', type=To}] but some elements were not found: [Recipient{name='Sven Sielenkemper', address='/o=otris/ou=Exchange Administrative Group (FYDIBOHF23SPDLT)/cn=Recipients/cn=ad5e65c5e53d40e2825d738a39904682-sielenkemper', type=To}] and others were not expected: [Recipient{name='Sven Sielenkemper', address='sielenkemper@otris.de', type=To}]

For the .eml and .msg files in this zip: #486 TestValidSigned messages.zip (from #40)

Perhaps you can have a look at it?

sanastasiadis commented 5 months ago

It seems this error comes from the test class EmailConverterTest of simple-java-mail, and it is the test _Github486InvalidSignedOutlookMessage.

This test compares the results of parsing between the .eml version of the file and the .msg version of the file. The parsed eml version of the file is represented as the "actual" here, and the parsed msg version of the file is the "expected" one.

The "actual" list of recipients comes from simple-java-mail: org.simplejavamail.converter.internal.mimemessage.MimeMessageParser.

The "expected" list of recipients comes from outlook-message-parser:

org.simplejavamail.outlookmessageparser.OutlookMessageParser.

Apparently the "expected" list of recipients is as expected (:smiley:) and here follows an explanation:

In the file "#486 TestInvalidSignedOutlookMessage.msg" It looks that the entry for

[Recipient{name='Sven Sielenkemper', address='sielenkemper@otris.de', type=To}]

has both properties:

0x3003: /o=otris/ou=Exchange Administrative Group (FYDIBOHF23SPDLT)/cn=Recipients/cn=ad5e65c5e53d40e2825d738a39904682-sielenkemper

and

0x39fe: sielenkemper@otris.de

However the code handles it correctly because:

  1. The code in OutlookRecipient.handleNameAddressProperty, seems it doesn't overwrite the address value if it is not null.
  2. The properties are encountered and parsed in the following order: first the 0x3003 and then the 0x39fe. So, the value of 0x3003 is not overwritten (see pt1).
  3. The regular expression is correct to match multiple "cn" occurrences.

As I see the error produced by the unit test in your comment, the "actual" result could have been produced with the previous version of the regular expression (with the ?). I reproduced the issue parsing the msg file locally.

Using the previous version of the regular expression (with the ?), the X500 address is not recognised and the address field is left null to be filled with the address property of 0x39fe which comes next when parsing. Output:

[name: Sven Sielenkemper, address: sielenkemper@otris.de] [name: sven.sielenkemper@gmail.com, address: sven.sielenkemper@gmail.com]

On the other hand, the suggested version of the regular expression (with the *) matches correctly the given X500 address, and as the 0x3003 property comes first, it fills the address field. Then when parsing the 0x39fe, it is skipped as the address field is not null anymore. Output as expected:

[name: Sven Sielenkemper, address: /o=otris/ou=Exchange Administrative Group (FYDIBOHF23SPDLT)/cn=Recipients/cn=ad5e65c5e53d40e2825d738a39904682-sielenkemper] [name: sven.sielenkemper@gmail.com, address: sven.sielenkemper@gmail.com]

I'm now thinking that since both properties 0x3003 and 0x39fe exist, maybe the latter should overwrite the first one, because the email address has "priority" over the X500 address? What do you think?

bbottema commented 5 months ago

Thank you so much for having a good look, you explained it well.

Regarding multiple apparent recipients (I'm still getting used to X500, as it strikes me as totally alien), I'm not sure myself. For easy reproduction, I've added the email and test case in this project in the branch bugfix/#73-x500-recipient-issue.

This test turns ✔ green:

@Test
public void testGithubIssue73_AttachmentNameWithSpecialCharacters()
        throws IOException {
    OutlookMessage msg = parseMsgFile("test-messages/OutlookMessage with X500 dual address.msg");

    // just assert what currently comes back from the parser, not what the parser should do, because I'm not sure yet what the parser should do
    OutlookMessageAssert.assertThat(msg).hasOnlyRecipients(
        createRecipient("Sven Sielenkemper", "/o=otris/ou=Exchange Administrative Group (FYDIBOHF23SPDLT)/cn=Recipients/cn=ad5e65c5e53d40e2825d738a39904682-sielenkemper"),
        createRecipient("sven.sielenkemper@gmail.com", "sven.sielenkemper@gmail.com")
    );
}

I'm not sure myself what we should be expecting here.

Thoughts? Perhaps we should just compare it to what Outlook does:

Sven Sielenkemper sielenkemper@otris.de sven.sielenkemper sven.sielenkemper@gmail.com

image

sanastasiadis commented 5 months ago

Yes, apparently Outlook has a place for every property to accommodate them. :-) However, in the case of outlook-message-parser there is only a single "address" field for every recipient.

The most "proper" solution maybe would be to add an extra field to the OutlookRecipient to accommodate the X500 address in case it exists? The client then would pick-up whatever serves better to him. What do you think?

bbottema commented 5 months ago

Indeed, that may be the best way forward. Then the primary behaviour mimics Outlook, while preserving X500 data if available.

bbottema commented 5 months ago

I created PR https://github.com/bbottema/outlook-message-parser/pull/75 for this. This seems to work well. Perhaps you can have a look as well.

bbottema commented 5 months ago

New fix released in 1.13.2. Thanks again!