GerbenJavado / LinkFinder

A python script that finds endpoints in JavaScript files
https://gerbenjavado.com/discovering-hidden-content-using-linkfinder
MIT License
3.64k stars 588 forks source link

Improve performance of context search #45

Closed Bankde closed 5 years ago

Bankde commented 5 years ago

Issue

Searching link with context is too slow

Cause and solution

Result

New version

$ time (python linkfinder.py -i https://www.nu.nl/ -d)

real    0m37.334s
user    0m16.256s
sys 0m0.092s

Old version

$ time (python linkfinder.py -i https://www.nu.nl/ -d)

real    6m36.491s
user    5m58.722s
sys 0m0.083s

More

d = {} regex = re.compile("

<a href='([^'\n]*)' class='text'>")

f = open("new_today.html", "r") w1 = open("tmp1", "w+") content = f.read() items = re.findall(regex, content) for item in items: if item in d: print("Dupe %s" % (item)) d[item] = 1 w1.write(item + "\n")

f = open("old_today.html", "r") w2 = open("tmp2", "w+") content = f.read() items = re.findall(regex, content) for item in items: if item not in d: print("Miss from new %s" % (item)) d[item] = 2 w2.write(item + "\n")

for item in d: if d[item] == 1: print("Miss from old %s" % (item))

w1.close() w2.close() print("Done")



# Further action
- I would love any recommendation, test, feedback before we merge into main. I don't want anything to crash :P
Bankde commented 5 years ago

Btw, because the way of our original regex: I have to use m.group(1) instead of m.group(0) during re.finditer because 0 will also return double quotes/single quote. I already wrote the description but still feeling unsure. Should I go back to regex_group_name or stay like this or any better options ?

Either choice will not affect our original LinkFinder, but it will affect anyone who decide to import our module and use their own custom regex.

GerbenJavado commented 5 years ago

Wow! This is indeed a lot faster. Thanks for the insight with why the regex was slow. It makes a lot of sense but didn't realise at the beginning. Can also confirm it works so will merge.