elceef / dnstwist

Domain name permutation engine for detecting homograph phishing attacks, typo squatting, and brand impersonation
https://dnstwist.it
Apache License 2.0
4.81k stars 764 forks source link

Using homoglyph module for string proximity ? #88

Closed darcosion closed 4 years ago

darcosion commented 4 years ago

I've used quickly dnstwist and I've seed basic homoglyph transpositions just here : https://github.com/elceef/dnstwist/blob/master/dnstwist.py#L304

I know another project named homoglyph : https://github.com/life4/homoglyphs

It is interesting to implement homoglyph module instead of static homoglyph transposition ?

Regards, Darcosion.

elceef commented 4 years ago

The static homoglyph mapping has been carefully verified to be IDNA compatible and most importantly registerable as we're dealing with domain names. There are some policies when it comes to registering domain names (each TLD authority have slightly different) and you can't just pick and use any look alike Unicode character you wish. Nonetheless thanks for pointing out this project - looks interesting.

darcosion commented 4 years ago

Thanks for the explanation, if I understand, without specific TLD (like chinese TLD for example), it's difficult to have IDN interesting for homoglyph attack. 👍

Well, maybe working with a "TLD authority detector" for creating a set of homoglyph mapping could be a good enhancing of this tool ? Because, for example, here : https://github.com/elceef/dnstwist/blob/master/dnstwist.py#L320 The greek letter "ο" (omicron, U+03BF) isn't used, and with some resolver, it could work, if I'm refer to Eurid list of accepted homoglyph : https://eurid.eu/media/filer_public/7a/3b/7a3baaeb-cbf6-4840-8ddc-097a83a91b79/idna2008and_homoglyph_bundling_tables.pdf

Anyway, thanks again for explanation, I think we could close this issue and if you want to discuss about TLS homoglyph set, I'm available by mail. :blush:

elceef commented 4 years ago

In general mixing of characters from different Unicode scripts (like Latin and Greek) is forbidden.

darcosion commented 4 years ago

You've right, I have found a unicode.org reference about that : https://www.unicode.org/reports/tr36/idn-chars.html

So maybe working to a set of homoglyph for every script could be an interesting work in the futur ? I'm going to investigate that, see you soon. ;)

elceef commented 4 years ago

I've considered that already. The problem is that since you can't mix Unicode scripts, you would have to replace every Latin character with Greek, Cyrilic or other homoglyph which is usually not possible or results in a poorly looking domains in terms of visual similarity.