Removed IDNA-incompatible chars

elceef / dnstwist

Domain name permutation engine for detecting homograph phishing attacks, typo squatting, and brand impersonation

https://dnstwist.it

Apache License 2.0

4.81k stars 764 forks source link

Removed IDNA-incompatible chars #87

Closed thisismyrobot closed 4 years ago

thisismyrobot commented 4 years ago

Hey @elceef, I've been doing an uplift to be 100% IDNA 2008 compatibility on dnstwister and I wanted to share back some changes based on that and some new characters I've found while working on the service.

My additional chars are at the end of each line to help with the diff and I removed 5 chars that wouldn't encode with the idna pip package: 'Ƅ', 'Þ', 'ṫ', 'ṭ' and 'ẋ'.

This is a snippet of the Python 3 code I ran to check them all and to identify the five that needed removal:

import idna

glyphs = {
'a': [u'à', u'á', u'â', u'ã', u'ä', u'å', u'ɑ', u'ạ', u'ǎ', u'ă', u'ȧ', u'ą', u'а', u'ӓ'],
...
'z': [u'ʐ', u'ż', u'ź', u'ᴢ', u'ƶ', u'ẓ', u'ẕ', u'ⱬ']
}

for k, chars in glyphs.items():
    for i, c in enumerate(chars):
        try:
            idna.encode(f'{c}.com')
        except:
            print(f'In glyphs for {k}, index {i} ({c} / {c.encode()}) is not IDNA2008 compatible.')

elceef commented 4 years ago

Hi Robert, I've been working on something similar - expanding and verifying the set of Unicode homoglyphs, although I haven't considered IDNA 2008 compatibility which is more strict compared to IDNA 2003 which dnstwist is currently using. Regarding your change, the characters that you removed are in fact IDNA2008-incompatible and this is something I'd like to improve. However when it comes to the new characters you want to add to the homoglyph set, these come from non-Latin Unicode scripts like Cyrillic. Unfortunately, although perfectly IDNA2008-compatible, in practice, when you mix Unicode scripts like Latin and Cyrillic you end up with a domain name that won't be registered. Keep in mind that many domain registrars lack proper validation and will claim such domains available but will be discarded by TLD authority anyway during the registration process. As mentioned at the beginning, I've been working on expanding the list of homoglyphs so in the nearest future it is going to be expanded with some Latin ones. To sum up, I like the idea of filtering out IDNA2008-incompatible characters, but adding the new ones from non-Latin Unicode scripts is unpractical.

thisismyrobot commented 4 years ago

Thanks Marcin, that's great context about where you're going with things, I'm more than happy to rework this PR to be just about the removal of the non IDNA2008 domains. I'd also missed the subtleties around mixing scripts, I'd totally assumed the idna module would be throwing if it was unregisterable.

elceef commented 4 years ago

Thanks for the updated PR. How did you validate the homoglyph dictionary with IDNA 2008? I've used idna.encode() and there are only three invalid: ẋ Ƅ Þ

elceef commented 4 years ago

Removed three IDNA2008 incompatible characters mentioned earlier.

thisismyrobot commented 4 years ago

@elceef sorry for the delay, I couldn't remember why they were added initially - 100% my own fault for doing too many different changes in the initial PR. The 't' changes were due to the last 2 being duplicates and the 'b' changes were because the last two chars were the same but with different casing. Totally up to you as to whether that's what you want in there or not :)

elceef commented 4 years ago

Thank you for spotting the duplicates. This is valuable contribution. Different case characters has been fixed already.