analyse.tld does not scale linearly

aau-network-security / richkit

Domain Enrichment Toolkit $ pip install richkit

https://pypi.org/project/richkit/

MIT License

11 stars 3 forks source link

analyse.tld does not scale linearly #66

Closed ala-csis closed 4 years ago

ala-csis commented 4 years ago

The runtime complexity of some of the functions may be prohibitive. Consider that of richkit.analyze.tld() as an example. Is this really due to the fact that TLDs are intrinsically difficult to compute (e.g., by accounting for examples such as *.co.uk) or could this be streamlined?

The output of the attached code (see below) is as follows:

1 domains processed:
split(): 0.0009176731109619141 s
Richkit: 0.017614364624023438 s

10 domains processed:
split(): 0.0008223056793212891 s
Richkit: 0.1530759334564209 s

100 domains processed:
split(): 0.0008115768432617188 s
Richkit: 1.5235605239868164 s

1000 domains processed:
split(): 0.001218557357788086 s
Richkit: 15.542202234268188 s

Benchmarking code: runtime.py.txt (remove .txt extension)
Data: domains.csv.txt (remove .txt extension)

kidmose commented 4 years ago

This behaviour (@ala-csis results above) seems quite wrong;

Appart from a startup cost, I'd expect tld to scale linearly, as the analysis of one domain only should rely on that one domain and the (constant) Public Suffix List. This needs some digging into.

kidmose commented 4 years ago

Running of develop:

$ python runtime.py 
1 domains processed:
split(): 0.0008974075317382812 s
Richkit: 0.0007765293121337891 s

10 domains processed:
split(): 0.0009582042694091797 s
Richkit: 0.0012447834014892578 s

100 domains processed:
split(): 0.0008704662322998047 s
Richkit: 0.0012927055358886719 s

1000 domains processed:
split(): 0.0012729167938232422 s
Richkit: 0.005160093307495117 s

kidmose commented 4 years ago

@ala-csis Do you wanna weigh in on whether or not this is done (For now)?

ala-csis commented 4 years ago

@kidmose Thanks for the update. I'm aware of it, but can we let the issue open for another week or so? I haven't had the chance to look at it as I'm going over some of the material DTU and I are working on.

ala-csis commented 4 years ago

Great, thanks. These are the numbers on my side.

1 domains processed:
split(): 0.0008490085601806641 s
Richkit: 0.000591278076171875 s

10 domains processed:
split(): 0.0004780292510986328 s
Richkit: 0.0005376338958740234 s

100 domains processed:
split(): 0.0005049705505371094 s
Richkit: 0.0007908344268798828 s

1000 domains processed:
split(): 0.0008392333984375 s
Richkit: 0.0032606124877929688 s