Closed ala-csis closed 4 years ago
This behaviour (@ala-csis results above) seems quite wrong;
Appart from a startup cost, I'd expect tld
to scale linearly, as the analysis of one domain only should rely on that one domain and the (constant) Public Suffix List.
This needs some digging into.
Running of develop
:
$ python runtime.py
1 domains processed:
split(): 0.0008974075317382812 s
Richkit: 0.0007765293121337891 s
10 domains processed:
split(): 0.0009582042694091797 s
Richkit: 0.0012447834014892578 s
100 domains processed:
split(): 0.0008704662322998047 s
Richkit: 0.0012927055358886719 s
1000 domains processed:
split(): 0.0012729167938232422 s
Richkit: 0.005160093307495117 s
@ala-csis Do you wanna weigh in on whether or not this is done (For now)?
@kidmose Thanks for the update. I'm aware of it, but can we let the issue open for another week or so? I haven't had the chance to look at it as I'm going over some of the material DTU and I are working on.
Great, thanks. These are the numbers on my side.
1 domains processed:
split(): 0.0008490085601806641 s
Richkit: 0.000591278076171875 s
10 domains processed:
split(): 0.0004780292510986328 s
Richkit: 0.0005376338958740234 s
100 domains processed:
split(): 0.0005049705505371094 s
Richkit: 0.0007908344268798828 s
1000 domains processed:
split(): 0.0008392333984375 s
Richkit: 0.0032606124877929688 s
The runtime complexity of some of the functions may be prohibitive. Consider that of
richkit.analyze.tld()
as an example. Is this really due to the fact that TLDs are intrinsically difficult to compute (e.g., by accounting for examples such as*.co.uk
) or could this be streamlined?The output of the attached code (see below) is as follows:
.txt
extension).txt
extension)