elceef / dnstwist

Domain name permutation engine for detecting homograph phishing attacks, typo squatting, and brand impersonation
https://dnstwist.it
Apache License 2.0
4.76k stars 760 forks source link

Feature Request: untwist #141

Closed mgrant0 closed 1 year ago

mgrant0 commented 2 years ago

I'd like to have a way to do the reverse of dnstwist. I'd like to be able to give it a domain name like tes1a.com and have it tell me this domain is similar to tesla.com (and perhaps other registered domains that it could be similar to).

I did try running tes1a.com through dnstwist https://dnstwister.report/ but it did not come up with tesla.com as one of the twists of tes1a.com.

elceef commented 2 years ago

I think the built-in homoglyph mapping works in both directions for ASCII characters, except for this particular case you reported. This is something I can fix quickly.

elceef commented 2 years ago

Added homoglyph in commit 6bf6689d45f2790af04adc76d2342941deb81f0a

mgrant0 commented 2 years ago

Interesting, this isn't what the feature request was but glad it helped fix a bug!

I looked at that area of the code and there's quite a few other homoglyphs that I could think of off the top of my head. I'm surprised there isn't some much longer list of these.

Þ for p and maybe b is missing. And there's many upper case variants of letters when you use the upper case version of the domain name. I feel there's probably quite a lot of missing homoglyphs looking at your table.

elceef commented 2 years ago

Is Þ homoglyph of p or b, or both? If you have more homoglyphs, I will be more than happy to take a look. The homoglyph mapping requires a review anyway. Keep in mind that not every domain can be registered, despite the fact you can punycode it.

mgrant0 commented 2 years ago

I would say both! Þ is quite an interesting character! It's the th sound in icelandic and was in use in old english but has fallen into disuse. Such a shame, a very useful letter!

I will see if I can find a larger list for you. Surely someone must have compiled a decent list. It'd take hours to comb through the unicode list to recreate that. And then someone would have to know which characters are valid punycode domains.

I managed to find this site: https://util.unicode.org/UnicodeJsps/confusables.jsp

It's linked to from the references on this page http://www.unicode.org/reports/tr39/#Confusable_Detection

Which I got to from this page: http://www.unicode.org/reports/tr39/#Restriction_Level_Detection

These groups have gone to some lenths to discuss and enumerate the problem. It seems like it would be a good idea to go through each ascii letter both lower case and upper case, numbers, and the dash (-), enter them into the confusable website (first link above) and use the resulting list for that character's homographs in your table.

Incidentally the Þ character doesn't show up as confusing for p or b in their database! But it has quite a few others that I didn't notice.

elceef commented 2 years ago

I'm very familiar with theses lists. Using all of these characters in my tool would be extremely inefficient. As I mentioned in the README file, what's the point in searching domain names which can't be even registered?