HaveIBeenPwned / EmailAddressExtractor

A project to rapidly extract all email addresses from any files in a given path
BSD 3-Clause "New" or "Revised" License
64 stars 23 forks source link

Domains should not be able to end with a period #65

Open troyhunt opened 9 months ago

troyhunt commented 9 months ago

I just added a failing test for this (DomainEndingInPeriodIsInvalid) after finding some junk that had made its way through.

KonajuGames commented 9 months ago

A trailing period in a domain is valid. It represents the root hierarchy, and is usually assumed and not printed.

troyhunt commented 9 months ago

Is that the case for either a publicly addressable domain accessed over HTTP or the domain part of an email address? Happy to be proven wrong, but I don't think there's a single website address that has a period immediately after the TLD, is there?

KonajuGames commented 9 months ago

It is one of those things that may have been required in the early days of the internet when subdomains were used without the TLD being specified, but has become so common to use the full domain and omit the root hierarchy that it is now the default. It may be a better approach to remove it if present before reporting.

troyhunt commented 9 months ago

Yeah, thought that might be the case. The decision we keep running into with this project is strict adherence to RFCs versus filtering out junk. If I was to look at the occurrences of periods at the end of domains in everything from domain searches to breach parsing to people signing up to notifications, I bet 100% of them will be parsing or user input errors rather than compliance with an RFC edge case! I'm also of the view that if you could legitimately put a period at the end of say, an address, you'd be blocked in so many places you'd quickly stop doing that!

KonajuGames commented 9 months ago

That is a valid view, and related to the other issues where emails may start with { or /, but in real life the users impacted by a data breach will never have that. I would be in favour of stripping them per how we use them on the internet today rather than strict RFC compliance.

Johno-ACSLive commented 2 days ago

@troyhunt is this still an issue? I just created a dummy file with a period at the end - user@123.com. - and the extractor removed the period as it wasn't present in the output file.

troyhunt commented 10 hours ago

Yep, still an issue. In fact, this is our only failing test right now: https://github.com/HaveIBeenPwned/EmailAddressExtractor/blob/7bf5312f5600cda5e4c746c6ece4c4168fbbbef3/test/LegacyTests.cs#L168

Johno-ACSLive commented 10 hours ago

Interesting, OK I'll investigate further.