Closed mgrant0 closed 1 year ago
I think the built-in homoglyph mapping works in both directions for ASCII characters, except for this particular case you reported. This is something I can fix quickly.
Added homoglyph in commit 6bf6689d45f2790af04adc76d2342941deb81f0a
Interesting, this isn't what the feature request was but glad it helped fix a bug!
I looked at that area of the code and there's quite a few other homoglyphs that I could think of off the top of my head. I'm surprised there isn't some much longer list of these.
Þ for p and maybe b is missing. And there's many upper case variants of letters when you use the upper case version of the domain name. I feel there's probably quite a lot of missing homoglyphs looking at your table.
Is Þ homoglyph of p or b, or both? If you have more homoglyphs, I will be more than happy to take a look. The homoglyph mapping requires a review anyway. Keep in mind that not every domain can be registered, despite the fact you can punycode it.
I would say both! Þ is quite an interesting character! It's the th sound in icelandic and was in use in old english but has fallen into disuse. Such a shame, a very useful letter!
I will see if I can find a larger list for you. Surely someone must have compiled a decent list. It'd take hours to comb through the unicode list to recreate that. And then someone would have to know which characters are valid punycode domains.
I managed to find this site: https://util.unicode.org/UnicodeJsps/confusables.jsp
It's linked to from the references on this page http://www.unicode.org/reports/tr39/#Confusable_Detection
Which I got to from this page: http://www.unicode.org/reports/tr39/#Restriction_Level_Detection
These groups have gone to some lenths to discuss and enumerate the problem. It seems like it would be a good idea to go through each ascii letter both lower case and upper case, numbers, and the dash (-), enter them into the confusable website (first link above) and use the resulting list for that character's homographs in your table.
Incidentally the Þ character doesn't show up as confusing for p or b in their database! But it has quite a few others that I didn't notice.
I'm very familiar with theses lists. Using all of these characters in my tool would be extremely inefficient. As I mentioned in the README file, what's the point in searching domain names which can't be even registered?
I'd like to have a way to do the reverse of dnstwist. I'd like to be able to give it a domain name like tes1a.com and have it tell me this domain is similar to tesla.com (and perhaps other registered domains that it could be similar to).
I did try running tes1a.com through dnstwist https://dnstwister.report/ but it did not come up with tesla.com as one of the twists of tes1a.com.