Open RohanNagar opened 3 years ago
Thanks Rohan -- first, I've been out of the email world for a long time, so I'm probably stuck in 2822 and haven't paid attention to the new RFCs, so apologies in advance if any of this is wrong:
Nice to see someone taking email validation seriously! Your results do seem to show a number of problems with our library. I'm not involved in maintaining it these days, so maybe @bbottema will have time to take a look at some of it.
Some things I noticed or wondered about in the list of demo addresses:
Did you implement with a lexer?
Thanks for reaching out.
@chconnor Thanks for taking a look! I'll try to answer each of these:
I don't know why "Joe Smith" is validated by our library. That surprised me.
That surprised me as well, seems like it should have been caught.
Your library (and others) failed 1234567890123456789012345678901234567890123456789012345678901234+x@example.com -- what am I missing? I would have thought that valid.
Not sure if this was defined in RFC 2822, but in RFC 5321, the local part cannot be more than 64 octets (this email is 66).
it does look like our library doesn't parse domains very well (too permissive). It should be noted that the library wasn't originally developed to be a strict 2822 enforcer, but rather a way to clean up and extract usable addresses... it can serve both roles, but I think we're seeing the weaknesses here.
Definitely makes sense, and strictly enforcing any of these email address RFCs is hard!
I'm unclear why your library failed email@[3.256.255.23] -- AFAICT the "dtext" of 2822 allows almost anything inside the braces? Same question for many such addresses, such as first.last@[IPv6::], but maybe there are new RFCs that enforce the contents of the domain-literal? Our library will validate x@[anything in here], which explains a lot of the discrepancies
This may be true. My library failed it because the IPv4 address was invalid, specified in RFC 5321 here. However, I wasn't able to find anything explicitly denying or allowing anything within brackets in the domain. I'll have to dig into that further.
x@x23456789.x23456789.x23456789.x23456789.x23456789.x23456789.x23456789.x23456789.x23456789.x23456789.x23456789.x23456789.x23456789.x23456789.x23456789.x23456789.x23456789.x23456789.x23456789.x23456789.x23456789.x23456789.x23456789.x23456789.x23456789.x23456 looks valid to me, but your library failed it?
I believe this is also because of RFC 5321, restricting the size of the domain to 255 octets.
test@123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345 678901234567890123456789012345678901234567890123456789012.com also looks valid to me? Is there a hostname size limit? I notice a couple others that you fail (that we don't) that have long TLDs or SLDs -- is that RFC?
Yes, again with the octet limit in RFC 5321 each domain label can only be 63 octets.
""@test.org -- looking at quoted-string it does seem like qcontent is allowed to be empty?
Thanks, after looking into it I think you're actually right and I should fix this.
We don't implement the obs- tokens, which I think explains why very.unusual."@".unusual.com@example.com and much."more\ unusual"@example.com are considered valid by your library but not ours, as well as many others in the list... I believe that addr-spec allows either dot-atom or quoted-string, which don't support those addresses, but apparently obs-local-part does via the "word" token. Kudos for going to distance on the obs- tokens. :-)
That makes sense as to the difference there. Thanks!
there are several addresses that have spaces in the right-hand side... IIRC there were some decisions made to be non-2822 compliant on a couple of points regarding whitespace in the domain; some info here. One of those "2822 allows it but nobody wants it" situations, so I think your library is more correct in that way.
Got it! I agree that nobody should want that :)
your library validates cal(woo(yay)hoopla)@iamcal.com (and similar addresses) and ours fails; I think it is in fact invalid? () not allowed unless quoted, no?
Comments should be allowed outside of quoted strings. See RFC 2822: https://datatracker.ietf.org/doc/html/rfc2822#section-3.2.3
Excerpt:
White space characters, including white space used in folding
(described in section 2.2.3), may appear between many elements in
header field bodies. Also, strings of characters that are treated as
comments may be included in structured field bodies as characters
enclosed in parentheses. The following defines the folding white
space (FWS) and comment constructs.
Strings of characters enclosed in parentheses are considered comments
so long as they do not appear within a "quoted-string", as defined in
section 3.2.5. Comments may nest.
In RFC 5322 there are clear examples of comments allowed outside of quotes.
There seem to be a lot of cases where non-ascii is present in the left-hand side... at least in my days I believe that was not allowed unless they were encoded (RFC-2047 last time I was involved; I'm sure it's different now). Perhaps that explains the differences between libraries in those cases.
Yes that definitely explains the difference in those cases. With internationalized email addresses, non-ascii characters can be allowed. See RFC 6530
he last seven examples on the list all look like a bug in our library (apparently something wrong with quoted-pair not getting parsed correctly?). I glanced through our Dragons.java and didn't see anything obviously wrong, but again, it's been a long time and I don't unfortunately have time to really debug it.
👍🏽
Did you implement with a lexer?
Yes, I did! Regex for email address validation seemed really ugly and difficult to understand for me, so I wanted to give it a shot.
I essentially iterate through the email address once, invalidating if I encounter something invalid, and building a meaningful Email
object that allows users of the library to easily access things like localPart()
, domain()
, topLevelDomain()
, etc.
Also, going through this made me realize I want to add descriptions for each of these demo addresses on the website so it's easier to figure out why something should be valid/invalid 🙂
@chconnor Thanks for taking a look! I'll try to answer each of these: Not sure if this was defined in RFC 2822, but in RFC 5321, the local part cannot be more than 64 octets (this email is 66).
Ah yeah, that looks to be a new (and welcome) restriction with the later RFCs. AFAIK 2822 only restricts line length (e.g. for email headers) so you could argue that, in practice, email addresses had to be under <~990 characters, but we didn't implement that, AFAIK.
your library validates cal(woo(yay)hoopla)@iamcal.com (and similar addresses) and ours fails; I think it is in fact invalid? () not allowed unless quoted, no?
Comments should be allowed outside of quoted strings. See RFC 2822: https://datatracker.ietf.org/doc/html/rfc2822#section-3.2.3
(EDIT: see following comment first.)
I don't think they are allowed in local-part, though, even accounting for the obs-local-part? I.e. I think you have to do "blah(comment)blah"@blah.com. You can do (comment)blah@blah.com or (comment)blah(comment)@blah.com, but I think if local-part itself has parens in it it has to be quoted or escaped?
In RFC 5322 there are clear examples of comments allowed outside of quotes.
(...just noting that all those examples have comments outside the local-part.)
Did you implement with a lexer?
Yes, I did! Regex for email address validation seemed really ugly and difficult to understand for me, so I wanted to give it a shot.
Yeah the regex is definitely a grind. +1 for a lexer. Glad someone finally did it. :-)
Also, going through this made me realize I want to add descriptions for each of these demo addresses on the website so it's easier to figure out why something should be valid/invalid slightly_smiling_face
Yeah, given that 99.999999% of people have no earthly idea the crazy stuff that is allowed by email specs I think documentation is always a good idea. :-)
Oops, just noticing that your demo emails don't have comments in the local-part! Sorry. I think the issue is that we can't handle nested comments because of regex recursing issues... lemme double check that...
@bbottema Could this line be why "John Smith" matched? My java regex chops are long gone, but does that line allow for matching mailboxName with an empty uniqueAddrSpec? I can't otherwise see how "John Smith" would validate as an address, since afaict we always require (correctly) at least the @ to be present.
I'd have to do some debugging to verify that. Maybe I'll get some time to spend it this week.
Sorry, I actually just took a further look on "John Smith" and it looks like a problem with how my website displays results.
The actual address being tested is Joe Smith <email@example.com>
, which should actually be valid (quoted identifiers). I'll need to fix that in my library.
Sorry for the false alarm on that one!
Oh, good. I was wondering why javax was validating it also. :-)
Looks like you have a great library developing. There have been a few alternative libraries developed over the years but this is the first one that really seems to go the distance on the RFCs and which I would consider using instead of our own. Nice work.
@bbottema we might need to finally remove the "The world's only more-or-less-2822-compliant Java-based email address extractor / verifier" tagline. :-) And maybe we could add a "see also" link or something.
I would even considering archiving this project, as a lexer is a far better way of doing things (including documentation and debugging). By a landslide. Also this particular library not only seems to be correcter compared to our own validations, but more up to date with newer RFC's as well. I'd say jmail supersedes this library on all accounts.
I would like to do a performance comparison though, but I suspect a lexer to be a lot faster. I think the artifact footprint is about the same, both having no 3rd party dependencies.
Thank you both for the kind words!
In terms of performance, JMail is on average almost 3x faster than email-rfc2822
. You can scroll down to the bottom of https://www.rohannagar.com/jmail/ to see the average times.
Implementation | Average Time |
---|---|
JMail | 9201.171867088608 ns |
Apache Commons | 19448.029620253164 ns |
Javax Mail | 19917.172215189872 ns |
email-rfc2822 | 29966.67430379747 ns |
To calculate those, I run validate on each email address 100 times, averaging all of the times together.
// test 100 times to get an avg
int repetitions = 100;
List<Long> times = new ArrayList<>(repetitions);
for (int i = 0; i < repetitions; i++) {
long start = System.nanoTime();
predicate.test(email); // this runs the validate method
long end = System.nanoTime();
times.add(end - start);
}
double avg = times.stream().mapToDouble(n -> n).average().orElse(0.0);
Quick update that I published version 1.2.0 of JMail which fixes issues that came up during this discussion - emails with quoted identifiers are now correctly handled, as well as empty quoted strings in the local-part.
Hello! I wrote my own Java email address validation library, JMail, and I wanted to compare it to other implementations (including this one) to see how it stood up.
I tested a wide variety of email addresses using this library with the following line of code:
During this comparison I found some correctness issues with this library, as you can see on the table at this website: https://www.rohannagar.com/jmail/
For example,
Joe Smith
is considered valid with this library. Similarly, domain parts that start or end with the-
character are considered valid when they should not be.I wanted to bring this to your attention so you could potentially make any fixes or improvements, or maybe you can point out something wrong in my interpretation of the RFCs 🙂