Closed troyhunt closed 1 month ago
A trailing period in a domain is valid. It represents the root hierarchy, and is usually assumed and not printed.
Is that the case for either a publicly addressable domain accessed over HTTP or the domain part of an email address? Happy to be proven wrong, but I don't think there's a single website address that has a period immediately after the TLD, is there?
It is one of those things that may have been required in the early days of the internet when subdomains were used without the TLD being specified, but has become so common to use the full domain and omit the root hierarchy that it is now the default. It may be a better approach to remove it if present before reporting.
Yeah, thought that might be the case. The decision we keep running into with this project is strict adherence to RFCs versus filtering out junk. If I was to look at the occurrences of periods at the end of domains in everything from domain searches to breach parsing to people signing up to notifications, I bet 100% of them will be parsing or user input errors rather than compliance with an RFC edge case! I'm also of the view that if you could legitimately put a period at the end of say, an address, you'd be blocked in so many places you'd quickly stop doing that!
That is a valid view, and related to the other issues where emails may start with { or /, but in real life the users impacted by a data breach will never have that. I would be in favour of stripping them per how we use them on the internet today rather than strict RFC compliance.
@troyhunt is this still an issue? I just created a dummy file with a period at the end - user@123.com. - and the extractor removed the period as it wasn't present in the output file.
Yep, still an issue. In fact, this is our only failing test right now: https://github.com/HaveIBeenPwned/EmailAddressExtractor/blob/7bf5312f5600cda5e4c746c6ece4c4168fbbbef3/test/LegacyTests.cs#L168
Interesting, OK I'll investigate further.
The extractor was already stripping the trailing period and if the other checks were passed we are left with presumably a valid email address. I would also be in favour of this behaviour in case a system exported email addresses with a trailing period (is it likely that an export would append or maybe even store an address with a trailing period? - I assume for the majority this would not be the case).
Have there been examples where the other checks allow invalid email addresses through when there is a trailing period?
I've created a PR which invalidates email addresses with a trailing period.
I think this works perfectly for now, I can't see any other places it would cause issues, certainly all the tests pass successfully. Thanks!
I just added a failing test for this (DomainEndingInPeriodIsInvalid) after finding some junk that had made its way through.