HaveIBeenPwned / EmailAddressExtractor

A project to rapidly extract all email addresses from any files in a given path
BSD 3-Clause "New" or "Revised" License
68 stars 23 forks source link

Domains should not be able to end with a period #65

Closed troyhunt closed 1 month ago

troyhunt commented 1 year ago

I just added a failing test for this (DomainEndingInPeriodIsInvalid) after finding some junk that had made its way through.

KonajuGames commented 1 year ago

A trailing period in a domain is valid. It represents the root hierarchy, and is usually assumed and not printed.

troyhunt commented 1 year ago

Is that the case for either a publicly addressable domain accessed over HTTP or the domain part of an email address? Happy to be proven wrong, but I don't think there's a single website address that has a period immediately after the TLD, is there?

KonajuGames commented 1 year ago

It is one of those things that may have been required in the early days of the internet when subdomains were used without the TLD being specified, but has become so common to use the full domain and omit the root hierarchy that it is now the default. It may be a better approach to remove it if present before reporting.

troyhunt commented 1 year ago

Yeah, thought that might be the case. The decision we keep running into with this project is strict adherence to RFCs versus filtering out junk. If I was to look at the occurrences of periods at the end of domains in everything from domain searches to breach parsing to people signing up to notifications, I bet 100% of them will be parsing or user input errors rather than compliance with an RFC edge case! I'm also of the view that if you could legitimately put a period at the end of say, an address, you'd be blocked in so many places you'd quickly stop doing that!

KonajuGames commented 1 year ago

That is a valid view, and related to the other issues where emails may start with { or /, but in real life the users impacted by a data breach will never have that. I would be in favour of stripping them per how we use them on the internet today rather than strict RFC compliance.

Johno-ACSLive commented 2 months ago

@troyhunt is this still an issue? I just created a dummy file with a period at the end - user@123.com. - and the extractor removed the period as it wasn't present in the output file.

troyhunt commented 2 months ago

Yep, still an issue. In fact, this is our only failing test right now: https://github.com/HaveIBeenPwned/EmailAddressExtractor/blob/7bf5312f5600cda5e4c746c6ece4c4168fbbbef3/test/LegacyTests.cs#L168

Johno-ACSLive commented 2 months ago

Interesting, OK I'll investigate further.

Johno-ACSLive commented 2 months ago

The extractor was already stripping the trailing period and if the other checks were passed we are left with presumably a valid email address. I would also be in favour of this behaviour in case a system exported email addresses with a trailing period (is it likely that an export would append or maybe even store an address with a trailing period? - I assume for the majority this would not be the case).

Have there been examples where the other checks allow invalid email addresses through when there is a trailing period?

I've created a PR which invalidates email addresses with a trailing period.

troyhunt commented 1 month ago

I think this works perfectly for now, I can't see any other places it would cause issues, certainly all the tests pass successfully. Thanks!