bbottema / email-rfc2822-validator

The world's only Java-based rfc2822-compliant email address validator and parser
64 stars 13 forks source link

Brackets and Parens not parsed properly and API documentation / usage needs improvement #19

Closed distributev closed 4 years ago

distributev commented 4 years ago

Hello, Based on the documentation I get the below JUnit failures when I would actually expect these tests to pass.

image

Attached you find a simple java maven project to replicate the above Junit test failures. My understanding is that I followed the documentation and, as per the documented examples, the JUnit tests should pass, however they all fail.

email-rfc2822-validator-junit-examples-fail.zip

How these tests should be modified in order to pass?

chconnor commented 4 years ago

Don't quote me on any of this; maybe @bbottema will correct me.

I think anything with <> around the email or "" around the proper name will need ALLOW_QUOTED_IDENTIFIERS to be in the address criteria, or it will fail.

E.g. instead of

                 // these email address validations should normally fail
                assertFalse(EmailAddressValidator.isValid("Kayaks.org <kayaks@kayaks.org>"));

                // but with a configuration they could be allowed
                assertTrue(EmailAddressValidator.isValid("Kayaks.org <kayaks@kayaks.org>",
                                EnumSet.of(EmailAddressCriteria.ALLOW_DOT_IN_A_TEXT)));

I think you want

                // these email address validations should normally fail
                assertFalse(EmailAddressValidator.isValid("Kayaks.org <kayaks@kayaks.org>",
                                EnumSet.of(EmailAddressCriteria.ALLOW_QUOTED_IDENTIFIERS)));

                // but with a configuration they could be allowed
                assertTrue(EmailAddressValidator.isValid("Kayaks.org <kayaks@kayaks.org>",
                                EnumSet.of(EmailAddressCriteria.ALLOW_DOT_IN_A_TEXT,
                                                    EmailAddressCriteria.ALLOW_QUOTED_IDENTIFIERS)));

Also it looks like line 48 of AppTest.java should be using ALLOW_PARENS_IN_LOCALPART, not ALLOW_SQUARE_BRACKETS_IN_A_TEXT.

See EmailAddressCriteria.DEFAULT for the default set of criteria.

bbottema commented 4 years ago

That's correct, but it only fixes the first test (allow dot), the others seem to be bugged.

It seems both the documentation is confusing in this regard and the code is broken. I'll have to dive deep with the debugger to see where it is going awry.


The last test is incorrect: invoking isValid without passing a criteria defaults to EmailAddressCriteria.DEFAULT (which is clearly stated in the javadoc), so the assertion there is incorrect. Passing EnumSet.noneOf(EmailAddressCriteria.class) makes the test pass.

So of these (fixed) tests, brackets and parens criteria don't seem to work properly:

public class AppTest {
    @Test // passes
    public void allowDotInaText() throws Exception {
        // these email address validations should normally fail
        assertFalse(EmailAddressValidator.isValid("Kayaks.org <kayaks@kayaks.org>"));

        // but with a configuration they could be allowed
        assertTrue(EmailAddressValidator.isValid("Kayaks.org <kayaks@kayaks.org>",
                EnumSet.of(ALLOW_DOT_IN_A_TEXT, ALLOW_QUOTED_IDENTIFIERS)));

    }

    @Test // fails
    public void allowSquareBracketsInaText() throws Exception {
        // these email address validations should normally fail
        assertFalse(EmailAddressValidator.isValid("[Kayaks] <kayaks@kayaks.org>"));

        // but with a configuration they could be allowed
        assertTrue(EmailAddressValidator.isValid("[Kayaks] <kayaks@kayaks.org>",
                EnumSet.of(ALLOW_SQUARE_BRACKETS_IN_A_TEXT, ALLOW_QUOTED_IDENTIFIERS)));

    }

    @Test // fails
    public void allowParensInLocalPart() throws Exception {

        // these email address validations should normally fail
        assertFalse(EmailAddressValidator.isValid("\"bob(hi)smith\"@test.com"));

        // but with a configuration they could be allowed
        assertTrue(EmailAddressValidator.isValid("\"bob(hi)smith\"@test.com",
                EnumSet.of(ALLOW_PARENS_IN_LOCALPART, ALLOW_QUOTED_IDENTIFIERS)));

    }

    @Test // passes
    public void allowQuotedIdentifiers() throws Exception {
        // these email address validations should normally fail
        assertFalse(EmailAddressValidator.isValid("\"John Smith\" <john.smith@somewhere.com>",
                EnumSet.noneOf(EmailAddressCriteria.class)));

        // but with a configuration they could be allowed
        assertTrue(EmailAddressValidator.isValid("\"John Smith\" <john.smith@somewhere.com>",
                EnumSet.of(ALLOW_QUOTED_IDENTIFIERS)));
    }
}
distributev commented 4 years ago

Hello,

Thank you for your effort with this library. This kind of library is really needed in the java space - all the other ways to validate email addresses are semi adhoc and come short in many situations.

While using the library I was a bit confused with the existence / meaning of the EmailAddressCriteria.DEFAULT - I don't know if my points make any sense but I'll try.

  1. My understanding is that if I don't pass any criteria to the API would be simpler and, behind the scene, the API could treat such calls with missing criteria as having the EmailAddressCriteria.DEFAULT (whatever 'default' means to the API, for me it is a bit unclear what DEFAULT means to rfc2822)

  2. this becomes even more obvious here

assertFalse(EmailAddressValidator.isValid("\"John Smith\" <john.smith@somewhere.com>", EnumSet.noneOf(EmailAddressCriteria.class)));

why a parameter like EnumSet.noneOf(EmailAddressCriteria.class) would be required when the absence of any parameter would make the API to be more concise / clear and inside the API implementation can interpreted the absence of any criteria parameters as being equivalent with EnumSet.noneOf(EmailAddressCriteria.class) or EmailAddressCriteria.DEFAULT or however the API needs this to be.

  1. What is the semantic / meaning difference between EnumSet.noneOf(EmailAddressCriteria.class) and EmailAddressCriteria.DEFAULT (I can read the code to see what is included in EmailAddressCriteria.DEFAULT but I am asking about the semantic / meaning difference not about the technical implementation inside the API)

  2. For instance (the same would be true for all validations) I know that " characters are not allowed and I need to verify that this is true in the Junit - I do

assertFalse(EmailAddressValidator.isValid("\"John Smith\" <john.smith@somewhere.com>"));

I don't add any criteria because this is a simple API call to validate an email address and I'm not looking to "allow" anything.

Then I know that I need to make the validation a little bit less restrictive and "allow" for Quoted Identifiers - so it makes sense to add the EnumSet.of(EmailAddressCriteria.ALLOW_QUOTED_IDENTIFIERS to "allow" what needs to be allowed (but it does not make sense to add any other extra criteria which is not semantically related with what I need to allow).

This call makes sense

assertTrue(EmailAddressValidator.isValid("\"John Smith\" <john.smith@somewhere.com>",
                EnumSet.of(EmailAddressCriteria.ALLOW_QUOTED_IDENTIFIERS))); 
  1. What is the meaning of EmailAddressCriteria.DEFAULT in the context of the rfc2822 standard? Is there something like "default" rfc2822 and then there is also a "non-default" version of rfc2822 which is different from the "default" rfc2822?

More specifically, the following calls from the Readme file - How the below validations differ one from the other?

// what type of RFC2822 is being used here?
boolean isValid = EmailAddressValidator.isValid(emailaddress); 

// this is the "default" version of the RFC2822?  (but not "compliant")
boolean isValid = EmailAddressValidator.isValid(emailaddress, EmailAddressCriteria.DEFAULT);

// and this is the "compliant" version of the RFC2822? (but not "default")
boolean isValid = EmailAddressValidator.isValid(emailaddress, EmailAddressCriteria.RFC_COMPLIANT);

Why in allowParensInLocalPart() test the ALLOW_QUOTED_IDENTIFIERS was not added and in all others tests it was added?

I prefer to have the API in the simplest form possible and keep the number of mandatory arguments at minimum. However, if ALLOW_QUOTED_IDENTIFIERS is required in allowDotInaText then it should be required in all other similar places.

bbottema commented 4 years ago

Why in allowParensInLocalPart() test the ALLOW_QUOTED_IDENTIFIERS was not added and in all others tests it was added?

I prefer to have the API in the simplest form possible and keep the number of mandatory arguments at minimum. However, if ALLOW_QUOTED_IDENTIFIERS is required in allowDotInaText then it should be required in all other similar places.

Yep, that was a typo I updated my comment. Unfortunately the test still fails.

Hello,

Thank you for your effort with this library. This kind of library is really needed in the java space - all the other ways to validate email addresses are semi adhoc and come short in many situations.

While using the library I was a bit confused with the existence / meaning of the EmailAddressCriteria.DEFAULT - I don't know if my points make any sense but I'll try.

(..)

Indeed, this has always been the weakest part of this library. I have struggled in the past on how to deal with and document this.

Pure COMPLIANT is there more for formality, because often people are not interested in fully compliance tests. DEFAULT is there because that's what people generally use most (anecdotal). The idea here is that when you use the library with as little as input possible, it defaults to what most people would find useful, which is the DEFAULT collection of criteria. If the library would assume that not passing any criteria explicitly means don't allow the other options, then that's an assumption on the library part and one that results in the most restrictive behavior, which is often not what the users want... or so we think.

distributev commented 4 years ago

Does it make sense to have the following two API calls to have identical behavior?

boolean isValid = EmailAddressValidator.isValid(emailaddress); 

boolean isValid = EmailAddressValidator.isValid(emailaddress, EmailAddressCriteria.DEFAULT);

meaning that, if no criteria is provided explicitly then the EmailAddressCriteria.DEFAULT will be assumed by the library.

If this is already happening then you should be good to go because from the jUnits tests I learned that all the different "ALLOW" options are disallowed when doing the simple

boolean isValid = EmailAddressValidator.isValid(emailaddress); 

API call without any criteria. Having all fail with the simplest API call is a good start because after that every time an individual ALLOW criteria is passed the corresponding email address instances should be allowed. for instance

assertTrue(EmailAddressValidator.isValid("Kayaks.org <kayaks@kayaks.org>",
                                EnumSet.of(EmailAddressCriteria.ALLOW_DOT_IN_A_TEXT)));

should do what is says, specifically EmailAddressCriteria.ALLOW_DOT_IN_A_TEXT

Does that make any sense?

bbottema commented 4 years ago

If this is already happening then you should be good to go

This is the case.

assertTrue(EmailAddressValidator.isValid("Kayaks.org <kayaks@kayaks.org>",
                                EnumSet.of(EmailAddressCriteria.ALLOW_DOT_IN_A_TEXT)));

should do what is says, specifically EmailAddressCriteria.ALLOW_DOT_IN_A_TEXT

Does that make any sense?

I guess this can be accomplished by making ALLOW_DOT_IN_A_TEXT imply (or extend) ALLOW_QUOTED_IDENTIFIERS, so that the library automatically adds the latter. Without the latter, text is not allowed at all. That might be a design flaw, but changing that is probably too difficult with the current regex setup. It's called Dragons for a reason :)

@chconnor, what's your take on this?

distributev commented 4 years ago

I believe it might be good to add ALLOW_QUOTED_IDENTIFIERS internally implied everywhere inside the API implementation since seeing email addresses with quoted identifiers in real life is something common.

After that it might also make sense to remove ALLOW_QUOTED_IDENTIFIERS references altogether from the docs so that users will not get confused (what sense it makes to allow something which is already allowed? but how can I disallow quoted identifiers? and all sorts of questions which might be triggered when reading about ALLOW_QUOTED_IDENTIFIERS criteria in the docs).

Could ALLOW_QUOTED_IDENTIFIERS be added to the existing EmailAddressCriteria.DEFAULT?

bbottema commented 4 years ago

Btw, it seems in the repo I already fixed the brackets problem, but it was never released. The test passes in the repo itself, so that update is coming...

The only test not passing then is the parens one and I have no clue how that regex is supposed to work:

String localPartqtext = format("[%s%s", noWsCtl,
            criteria.contains(EmailAddressCriteria.ALLOW_PARENS_IN_LOCALPART) ? "!#-\\[\\]-~]" : "!#-'\\*-\\[\\]-~]");

image

How is the parens criteria related to apostrophe and * and -?

chconnor commented 4 years ago

Re: regex -- that regex just excludes parens from the range of characters in the second form. (To answer your question directly, apostrophe and * surround the two parens in the ASCII table, and - is used for the range.) The only difference I see from my original code is that I escaped everything (unnecessarily, I assume, since you removed some of that and I tend to over-escape :-) ) and that I had noWsCtl + regex and you have it regex + noWsCtl, but that shouldn't matter... Assuming all the escaping is working as expected with the format() style string construction, I see no relevant difference between my original code and the latest code, so the bug either predates your version (i.e. I messed up somehow) or lies elsewhere in the code?

(Relevant RFC section.)

Re: semantics -- pesonally I'm OK with it as-is. I didn't parse all the comments above, but: part of the confusion seems to be the assumption that if you pass a criteria object to the method that you are extending the defaults as opposed to overriding them, which seems a questionable assumption to me under this paradigm. Perhaps the documentation for the two-parameter isValid call can have some language added: "By passing a criteria in to this method, you are overriding the criteria entirely, not extending it. E.g. you may want to include both ALLOW_QUOTED_IDENTIFIERS and ALLOW_PARENS_IN_LOCALPART in addition to whatever additional criteria you desire."

To make it more explicit, it could be required that a criteria object be created and passed in, which would by default include those two criteria unless otherwise modified, etc, but it seems a little overkill to me.

Alternately we could rename the "DEFAULT" criteria to something that doesn't imply that those settings are passively active. Like "RECOMMENDED" or something along those lines?

Could ALLOW_QUOTED_IDENTIFIERS be added to the existing EmailAddressCriteria.DEFAULT?

It is there already:

public static final EnumSet<EmailAddressCriteria> DEFAULT = of(ALLOW_QUOTED_IDENTIFIERS, ALLOW_PARENS_IN_LOCALPART);

bbottema commented 4 years ago

Alternately we could rename the "DEFAULT" criteria to something that doesn't imply that those settings are passively active. Like "RECOMMENDED" or something along those lines?

That makes sense to me.

bbottema commented 4 years ago

For now I've released 2.2.0 with the following fixes:

  1. Brackets should work now (see added junit test in the project)
  2. I've added Casey's clarification on use the Criteria parameter
  3. I've renamed DEFAULT to RECOMMENDED
  4. I've fixed all javadoc warnings as well

And I've condensed this comment thread a little bit as you might have noticed.

@chconnor, will you be able to look into the parens issue? Or do you have some lead for me to pursue.

chconnor commented 4 years ago

If you're still using: assertFalse(EmailAddressValidator.isValid("\"bob(hi)smith\"@test.com")); ...then that should actually be assertTrue... So, just to confirm what is still failing:

EmailAddressValidator.isValid("\"bob(hi)smith\"@test.com"); ...should be true, because it's using default criteria, which (properly) allows parens in localpart.

EmailAddressValidator.isValid("\"bob(hi)smith\"@test.com",
                EnumSet.of(ALLOW_PARENS_IN_LOCALPART, ALLOW_QUOTED_IDENTIFIERS)));

...should also be true, and:

EmailAddressValidator.isValid("\"bob(hi)smith\"@test.com",
                EnumSet.of(ALLOW_QUOTED_IDENTIFIERS)));

...should be false.

Specifically where does it go wrong, if that's not what you're seeing?

bbottema commented 4 years ago

Oh, my bad, I didn't read it properly. The test is actually green now.

Well that's that, then.