JoshData / python-email-validator

A robust email syntax and deliverability validation library for Python.
The Unlicense
1.14k stars 112 forks source link

Add TLD validation #111

Closed danielendresz closed 12 months ago

danielendresz commented 1 year ago

I have a list of old email addresses which I would like to validate. However, since many of the domains used for those email addresses are no longer in use and no longer have DNS records, I would get an error if I keep check_deliverability to True.

However, after I disabled the delivery check, a lot of emails came through which could have never been valid. The only problem here is the top level domains.

>>> email = "office@github.random"
>>> email_validator.validate_email(email, check_deliverability=False)
<ValidatedEmail office@github.random>
>>>

.random has never been an official top level domain recognized by ICANN, so I should have never been able to send an email to that address. However, since I can create my own mail server in my internal network with a custom top level domain, this check should not be mandatory.

So additionally to check_deliverability, check_tld would be great, set to True by default. And as a source, I think that file (https://data.iana.org/TLD/tlds-alpha-by-domain.txt) from IANA is great. However, I am not 100% sure if all ever active TLDs are present in this file.

JoshData commented 1 year ago

Hi.

The list of available TLDs changes regularly, so it would be a challenge to ensure the library has an up-to-date list on each call.

danielendresz commented 1 year ago

One option could be to perform a NS DNS query and check if there are associated name servers for that particular TLD. This would always be up-to-date, but wouldn't work for revoked TLDs.

Another option could be to save the TLDs from https://www.iana.org/domains/root/db into a list saved in your package, which can then be used to check even revoked TLDs against. If I understand it correctly, that list would only need to be updated, when a new TLD gets added. The last time a new TLD was added was as in April 2022, see https://www.iana.org/reports and search for "Delegation" or https://newgtlds.icann.org/en/program-status/delegated-strings.

JoshData commented 1 year ago

An NS check is an interesting idea that could speed up bulk validations (instead of full dns checks) but I'm not sure how helpful it really is since it will miss a lot of invalid domains below the tld level. And as you said it doesn't solve the original problem. It's hard for me to see a meaningful use case for it.

The second option makes sense but keeping a file up to date just doesn't sound fun for me as a maintainer, and complex to be 100% accurate all the time.

I'd be more open to adding an option for the caller to supply a list of valid TLDs and a utility function for retrieving the current list.

I know this isn't very helpful but it might make sense to just do a TLD check on your side. Just lowercase the address and check the ending.

nhairs-lumin commented 12 months ago

The list of available TLDs changes regularly, so it would be a challenge to ensure the library has an up-to-date list on each call.

Consider using tldextract which uses Mozilla's Public Suffix List. Although it does require network access on the first call (at least if you haven't manually cached the list), it does avoid the need for DNS lookups on every request.

JoshData commented 12 months ago

That's a great idea. Since it would be easy to do outside of this library, I'm inclined to not try to put this into this library. The main validate_email method returns an object that holds the domain portion of the email address, so it could be passed to tldextract easily enough.

nhairs-lumin commented 11 months ago

It feels like it should be implemented as part of the globally_deliverable argument (aside: this argument is missing from the README) - link to source.

JoshData commented 11 months ago

I get what you're saying but I don't think it's a good fit for a syntax check.