elceef / dnstwist

Domain name permutation engine for detecting homograph phishing attacks, typo squatting, and brand impersonation
https://dnstwist.it
Apache License 2.0
4.76k stars 760 forks source link

Tweaks #127

Closed monoidic closed 3 years ago

monoidic commented 3 years ago
elceef commented 3 years ago

Let me refer to your proposed changes briefly.

I have nothing against type hints, but I would prefer to wait until they are widely adopted and thus importing additional modules or objects won't be required. Especially considering the fact type hints do not bring any actual functionality. Please correct me if I'm wrong.

In practice, most of the data structures that you moved to the global namespace are specific and dedicated to the DomainFuzz class. I don't actually see the point in this change.

When it comes to optimizations, unfortunately the introduced changes significantly reduced the speed of the code within the generation of domain name permutations (see below). Mainly because it does not take advantage of python-idna module anymore, although other changes may also have a marginal impact.

$ time ./dnstwist.py --format list averylongdomainname.com >/dev/null
real    0m4,406s
user    0m4,329s
sys 0m0,076s
$ git checkout monoidic-master 
Switched to branch 'monoidic-master'
$ time ./dnstwist.py --format list averylongdomainname.com >/dev/null
real    0m5,734s
user    0m5,693s
sys 0m0,040s

Overall, the changes do not introduce speed gains (quite the opposite), but are rather cosmetic or visual and in fact make the code look a bit more clean or "pythonic" to some extent. Now I can see the impact of my C habits when was working on the fuzzing algorithms years ago :)

That being said, I'm not going to accept this pull request. However, I do like some of your changes, especially around fuzzing functions, and would like to cherry pick some of your ideas if you don't mind. On the other hand, I'd like to give you the credit for your contribution. Any ideas?

monoidic commented 3 years ago

I admit that the code may be a bit more code-golf-y than is suitable, especially in places with generator expressions and list/set comprehensions. Though here's some justification for some of the other changes. After adding back the idna module, these are my speed results (zsh syntax for time output). Current HEAD:

$ for i in {1..10}; do time ./dnstwist.py --format list averylongdomain.com > /dev/null; done
./dnstwist.py --format list averylongdomain.com > /dev/null  4,30s user 0,03s system 99% cpu 4,323 total
./dnstwist.py --format list averylongdomain.com > /dev/null  4,14s user 0,03s system 99% cpu 4,177 total
./dnstwist.py --format list averylongdomain.com > /dev/null  4,01s user 0,03s system 99% cpu 4,044 total
./dnstwist.py --format list averylongdomain.com > /dev/null  4,47s user 0,04s system 99% cpu 4,502 total
./dnstwist.py --format list averylongdomain.com > /dev/null  3,72s user 0,03s system 99% cpu 3,756 total
./dnstwist.py --format list averylongdomain.com > /dev/null  5,02s user 0,04s system 99% cpu 5,054 total
./dnstwist.py --format list averylongdomain.com > /dev/null  4,46s user 0,03s system 99% cpu 4,493 total
./dnstwist.py --format list averylongdomain.com > /dev/null  4,58s user 0,02s system 99% cpu 4,601 total
./dnstwist.py --format list averylongdomain.com > /dev/null  3,76s user 0,03s system 99% cpu 3,795 total
./dnstwist.py --format list averylongdomain.com > /dev/null  4,96s user 0,04s system 99% cpu 4,996 total

This PR, minus the idna module removal:

$ for i in {1..10}; do time ./dnstwist.py --format list averylongdomain.com > /dev/null; done
./dnstwist.py --format list averylongdomain.com > /dev/null  3,50s user 0,04s system 99% cpu 3,534 total
./dnstwist.py --format list averylongdomain.com > /dev/null  4,08s user 0,04s system 99% cpu 4,116 total
./dnstwist.py --format list averylongdomain.com > /dev/null  3,68s user 0,03s system 99% cpu 3,707 total
./dnstwist.py --format list averylongdomain.com > /dev/null  3,90s user 0,03s system 99% cpu 3,935 total
./dnstwist.py --format list averylongdomain.com > /dev/null  4,31s user 0,02s system 99% cpu 4,334 total
./dnstwist.py --format list averylongdomain.com > /dev/null  4,08s user 0,04s system 99% cpu 4,118 total
./dnstwist.py --format list averylongdomain.com > /dev/null  4,73s user 0,03s system 99% cpu 4,763 total
./dnstwist.py --format list averylongdomain.com > /dev/null  3,43s user 0,03s system 99% cpu 3,457 total
./dnstwist.py --format list averylongdomain.com > /dev/null  3,68s user 0,03s system 99% cpu 3,715 total
./dnstwist.py --format list averylongdomain.com > /dev/null  3,92s user 0,04s system 99% cpu 3,959 total

At the very least, making the two multiline strings with the dnstwist logo in them into raw strings to avoid unintentionally treating \ as an escape character won't hurt. They currently raise DeprecationWarnings, like dnstwist.py:4: DeprecationWarning: invalid escape sequence \/ The reason for moving constants out of functions is so they are not reconstructed on successive calls to the function. Here's a simple example:

>>> import dis
>>> def f():
...     x = {'a': 'b', 'c': 'd'}
...     return x['a']
... 
>>> dis.dis(f)
  2           0 LOAD_CONST               1 ('b')
              2 LOAD_CONST               2 ('d')
              4 LOAD_CONST               3 (('a', 'c'))
              6 BUILD_CONST_KEY_MAP      2
              8 STORE_FAST               0 (x)

  3          10 LOAD_FAST                0 (x)
             12 LOAD_CONST               4 ('a')
             14 BINARY_SUBSCR
             16 RETURN_VALUE
>>> m = {'a': 'b', 'c': 'd'}
>>> def f2():
...     return m['a']
... 
>>> dis.dis(f2)
  2           0 LOAD_GLOBAL              0 (m)
              2 LOAD_CONST               1 ('a')
              4 BINARY_SUBSCR
              6 RETURN_VALUE

Here, the x variable in f is reconstructed on each call to the function, as can be seen in the disassembly, while in f2, the constant is initialized outside of the function and is only accessed as a global variable. The disassembly is, of course, larger for the __homoglyph method, considering the size of glyphs, and similarly for DomainFuzz's __init__, where large constants are reconstructed on each call prior to the useful code running. Considering the code duplication, for the two passes in __homoglyph, it should probably be extracted to a different method. Also, I added õ to the glyphs for o. The magic numbers in __bitsquatting for ASCII string values wouldn't be hurt by being replaced with something more readable. Some of the iteration over range(len(x)) could be replaced with iterating either over x itself or enumerate(x). And range(0, x) can just be shortened to range(x) in many places. Why do the __homoglyph passes use while loops instead of for loops over a range if j isn't modified in the loop body? __vowel_swap is checking a condition a bit too late:

def __vowel_swap(self):
    vowels = 'aeiou'
    result = []
    for i in range(0, len(self.domain)):
        for vowel in vowels:
            if self.domain[i] in vowels:
                result.append(self.domain[:i] + vowel + self.domain[i+1:])
    return list(set(result))

The for vowel in vowels and if self.domain[i] in vowels should be swapped, so that you check the if statement len(vowels) times fewer times. Plus collecting data in a list, then deduplicating it with list(set(x)), could be made more efficient by just collecting data in a set to begin with and then converting it into a list afterwards. Or you can easily avoid generating duplicates in the first place in some cases, such as by avoiding replacing a character with itself in __replacement. Having nested function definitions without a clear reason (like for closures) seems like bad code organization as well.

elceef commented 3 years ago

Let's focus for now just on whether to make constants global. I think you oversimplified your code snippet too much and unnecessarily reduced it only to Python bytecode. I slightly changed the code to better reflect the actual implementation while still keeping it simple. As you can see, speed gains are negligible. In practice you can't tell the difference.

>>> import timeit
>>> 
>>> def f():
...     x = {'a': 'b', 'c': 'd'}
...     for i in range(100000):
...             c = x.get('a')
... 
>>> m = {'a': 'b', 'c': 'd'}
>>> def _f2():
...     for i in range(1000):
...             c = m.get('a')
... 
>>> def f2():
...     _f2()
...     for i in range(100):
...             _f2()
... 
>>> timeit.timeit('f()', number=10, setup='from __main__ import f')
0.08223409298807383
>>> timeit.timeit('f2()', number=10, setup='from __main__ import f2')
0.08814632496796548
monoidic commented 3 years ago

Very well. But now let's try this with a larger object, such as with the one from the original code. The results are clear.

local results: 0.010299316025339067
global results: 0.00017111399210989475

My code:

#!/usr/bin/env python3

import timeit

global_glyphs = {
    '2': ['ƻ'],
    '5': ['ƽ'],
    'a': ['à', 'á', 'à', 'â', 'ã', 'ä', 'å', 'ɑ', 'ạ', 'ǎ', 'ă', 'ȧ', 'ą'],
    'b': ['d', 'lb', 'ʙ', 'ɓ', 'ḃ', 'ḅ', 'ḇ', 'ƅ'],
    'c': ['e', 'ƈ', 'ċ', 'ć', 'ç', 'č', 'ĉ', 'ᴄ'],
    'd': ['b', 'cl', 'dl', 'ɗ', 'đ', 'ď', 'ɖ', 'ḑ', 'ḋ', 'ḍ', 'ḏ', 'ḓ'],
    'e': ['c', 'é', 'è', 'ê', 'ë', 'ē', 'ĕ', 'ě', 'ė', 'ẹ', 'ę', 'ȩ', 'ɇ', 'ḛ'],
    'f': ['ƒ', 'ḟ'],
    'g': ['q', 'ɢ', 'ɡ', 'ġ', 'ğ', 'ǵ', 'ģ', 'ĝ', 'ǧ', 'ǥ'],
    'h': ['lh', 'ĥ', 'ȟ', 'ħ', 'ɦ', 'ḧ', 'ḩ', 'ⱨ', 'ḣ', 'ḥ', 'ḫ', 'ẖ'],
    'i': ['1', 'l', 'í', 'ì', 'ï', 'ı', 'ɩ', 'ǐ', 'ĭ', 'ỉ', 'ị', 'ɨ', 'ȋ', 'ī', 'ɪ'],
    'j': ['ʝ', 'ǰ', 'ɉ', 'ĵ'],
    'k': ['lk', 'ik', 'lc', 'ḳ', 'ḵ', 'ⱪ', 'ķ', 'ᴋ'],
    'l': ['1', 'i', 'ɫ', 'ł'],
    'm': ['n', 'nn', 'rn', 'rr', 'ṁ', 'ṃ', 'ᴍ', 'ɱ', 'ḿ'],
    'n': ['m', 'r', 'ń', 'ṅ', 'ṇ', 'ṉ', 'ñ', 'ņ', 'ǹ', 'ň', 'ꞑ'],
    'o': ['0', 'ȯ', 'ọ', 'ỏ', 'ơ', 'ó', 'ö', 'ᴏ', 'õ'],
    'p': ['ƿ', 'ƥ', 'ṕ', 'ṗ'],
    'q': ['g', 'ʠ'],
    'r': ['ʀ', 'ɼ', 'ɽ', 'ŕ', 'ŗ', 'ř', 'ɍ', 'ɾ', 'ȓ', 'ȑ', 'ṙ', 'ṛ', 'ṟ'],
    's': ['ʂ', 'ś', 'ṣ', 'ṡ', 'ș', 'ŝ', 'š', 'ꜱ'],
    't': ['ţ', 'ŧ', 'ṫ', 'ṭ', 'ț', 'ƫ'],
    'u': ['ᴜ', 'ǔ', 'ŭ', 'ü', 'ʉ', 'ù', 'ú', 'û', 'ũ', 'ū', 'ų', 'ư', 'ů', 'ű', 'ȕ', 'ȗ', 'ụ'],
    'v': ['ṿ', 'ⱱ', 'ᶌ', 'ṽ', 'ⱴ', 'ᴠ'],
    'w': ['vv', 'ŵ', 'ẁ', 'ẃ', 'ẅ', 'ⱳ', 'ẇ', 'ẉ', 'ẘ', 'ᴡ'],
    'x': ['ẋ', 'ẍ'],
    'y': ['ʏ', 'ý', 'ÿ', 'ŷ', 'ƴ', 'ȳ', 'ɏ', 'ỿ', 'ẏ', 'ỵ'],
    'z': ['ʐ', 'ż', 'ź', 'ᴢ', 'ƶ', 'ẓ', 'ẕ', 'ⱬ']
}

def global_access(key: str) -> list[str]:
    return global_glyphs[key]

def local_access(key: str) -> list[str]:
    local_glyphs = {
        '2': ['ƻ'],
        '5': ['ƽ'],
        'a': ['à', 'á', 'à', 'â', 'ã', 'ä', 'å', 'ɑ', 'ạ', 'ǎ', 'ă', 'ȧ', 'ą'],
        'b': ['d', 'lb', 'ʙ', 'ɓ', 'ḃ', 'ḅ', 'ḇ', 'ƅ'],
        'c': ['e', 'ƈ', 'ċ', 'ć', 'ç', 'č', 'ĉ', 'ᴄ'],
        'd': ['b', 'cl', 'dl', 'ɗ', 'đ', 'ď', 'ɖ', 'ḑ', 'ḋ', 'ḍ', 'ḏ', 'ḓ'],
        'e': ['c', 'é', 'è', 'ê', 'ë', 'ē', 'ĕ', 'ě', 'ė', 'ẹ', 'ę', 'ȩ', 'ɇ', 'ḛ'],
        'f': ['ƒ', 'ḟ'],
        'g': ['q', 'ɢ', 'ɡ', 'ġ', 'ğ', 'ǵ', 'ģ', 'ĝ', 'ǧ', 'ǥ'],
        'h': ['lh', 'ĥ', 'ȟ', 'ħ', 'ɦ', 'ḧ', 'ḩ', 'ⱨ', 'ḣ', 'ḥ', 'ḫ', 'ẖ'],
        'i': ['1', 'l', 'í', 'ì', 'ï', 'ı', 'ɩ', 'ǐ', 'ĭ', 'ỉ', 'ị', 'ɨ', 'ȋ', 'ī', 'ɪ'],
        'j': ['ʝ', 'ǰ', 'ɉ', 'ĵ'],
        'k': ['lk', 'ik', 'lc', 'ḳ', 'ḵ', 'ⱪ', 'ķ', 'ᴋ'],
        'l': ['1', 'i', 'ɫ', 'ł'],
        'm': ['n', 'nn', 'rn', 'rr', 'ṁ', 'ṃ', 'ᴍ', 'ɱ', 'ḿ'],
        'n': ['m', 'r', 'ń', 'ṅ', 'ṇ', 'ṉ', 'ñ', 'ņ', 'ǹ', 'ň', 'ꞑ'],
        'o': ['0', 'ȯ', 'ọ', 'ỏ', 'ơ', 'ó', 'ö', 'ᴏ', 'õ'],
        'p': ['ƿ', 'ƥ', 'ṕ', 'ṗ'],
        'q': ['g', 'ʠ'],
        'r': ['ʀ', 'ɼ', 'ɽ', 'ŕ', 'ŗ', 'ř', 'ɍ', 'ɾ', 'ȓ', 'ȑ', 'ṙ', 'ṛ', 'ṟ'],
        's': ['ʂ', 'ś', 'ṣ', 'ṡ', 'ș', 'ŝ', 'š', 'ꜱ'],
        't': ['ţ', 'ŧ', 'ṫ', 'ṭ', 'ț', 'ƫ'],
        'u': ['ᴜ', 'ǔ', 'ŭ', 'ü', 'ʉ', 'ù', 'ú', 'û', 'ũ', 'ū', 'ų', 'ư', 'ů', 'ű', 'ȕ', 'ȗ', 'ụ'],
        'v': ['ṿ', 'ⱱ', 'ᶌ', 'ṽ', 'ⱴ', 'ᴠ'],
        'w': ['vv', 'ŵ', 'ẁ', 'ẃ', 'ẅ', 'ⱳ', 'ẇ', 'ẉ', 'ẘ', 'ᴡ'],
        'x': ['ẋ', 'ẍ'],
        'y': ['ʏ', 'ý', 'ÿ', 'ŷ', 'ƴ', 'ȳ', 'ɏ', 'ỿ', 'ẏ', 'ỵ'],
        'z': ['ʐ', 'ż', 'ź', 'ᴢ', 'ƶ', 'ẓ', 'ẕ', 'ⱬ']
    }
    return local_glyphs[key]

if __name__ == '__main__':
    print('global results:',
          timeit.timeit('global_access("a")', number=1000, setup='from test import global_access'))

    print('local results:',
          timeit.timeit('local_access("a")', number=1000, setup='from test import local_access'))

A test more similar to yours has similar results:

local results: 0.11596585100051016
global results: 0.10844587592873722

Code (glyph definitions omitted for brevity):


#!/usr/bin/env python3

import timeit

global_glyphs = {
...
}

def f(key: str) -> None:
    local_glyphs = {
    ...
    }
    for i in range(100000):
        c = local_glyphs[key]

def _f2(key: str) -> None:
    for i in range(1000):
            c = global_glyphs[key]

def f2(key: str) -> None:
    _f2(key)
    for i in range(100):
            _f2(key)

if __name__ == '__main__':
    print('local results:',
          timeit.timeit('f("a")', number=10, setup='from test import f'))
    print('global results:',
          timeit.timeit('f2("a")', number=10, setup='from test import f2'))
elceef commented 3 years ago

The homoglyph fuzzer doesn't just lookup and return a variable but among other things refer to it many times within the loops. Your original code snipped, regardless the size of the variable, doesn't reflect the actual implementation in any way. That being said, I don't see speed gains justifying such a code refactoring, especially these constants are specific to this particular class only.

monoidic commented 3 years ago

Another reason for extracting the glyph table would be for allowing one to provide their own, and having the glyph table provided in the code here simply be a default. Various TLDs have restrictions on the character sets permitted. For instance, .EE (section 3.2.2), .РФ (section 3.1.2.4), .DE (section V). Though this would have to be introduced itself beyond just moving some constants around. And you could still have a solution where you end up constructing it inside of the function, if you are so inclined.

elceef commented 3 years ago

When it comes to IDN restrictions required by TLD authorities, the general rule of thumb is to not mix characters from different scripts (for example latin with greek). In practice the restrictions are quite subtle, plus managing and keeping it up to date is not something I would like to do at the moment.

I have reviewed your changes. I'd love to merge the following:

Please note that I've recently reworked a few methods of DomainFuzz class so pull the most recent version first.

monoidic commented 3 years ago

managing and keeping it up to date is not something I would like to do at the moment

Hence the suggestion to make it user-configurable, similarly to tld_dictionary. I will look at updating the PR soon.

elceef commented 3 years ago

Kind reminder about PR :)

elceef commented 3 years ago

Thank you.