Open tpeng opened 7 years ago
Hey, I'm facing up the same issue. I have the explanation (but no solution to give sorry). It's mainly due to the tree structure used in the algo and the complexity of your URLs.
So, the number of tree-nodes created is 2^n-1
, and the number of leafs is 2^n
. Where n
is the number of parts in the path (words separated by /
) and the query (words separated by &
or =
or ?
) of the given URL.
On my computer, When n
is 20
, it takes around 10 seconds and ~2+ GB of memory to build the tree:
import time
from urlclustering.urltree import URLTreeNode
from urlclustering.parsedurl import ParsedURL
for i in range(15, 20):
t = time.time()
url = "http://domain.com/x?" + "&".join([str(x) for x in range(i)])
parsed_url = ParsedURL(url)
tree = URLTreeNode()
tree.add_url(parsed_url)
print("{} has {} parts".format(url, i + 1, time.time() - t))
print("Tree created in {:.2f} seconds, with {} leafs".format(time.time() - t, len(tree.leafs())))
print("-" * 32)
Output:
http://domain.com/x?0&1&2&3&4&5&6&7&8&9&10&11&12&13&14 has 16 parts
Tree created in 0.53 seconds, with 65536 leafs
--------------------------------
http://domain.com/x?0&1&2&3&4&5&6&7&8&9&10&11&12&13&14&15 has 17 parts
Tree created in 1.15 seconds, with 131072 leafs
--------------------------------
http://domain.com/x?0&1&2&3&4&5&6&7&8&9&10&11&12&13&14&15&16 has 18 parts
Tree created in 2.52 seconds, with 262144 leafs
--------------------------------
http://domain.com/x?0&1&2&3&4&5&6&7&8&9&10&11&12&13&14&15&16&17 has 19 parts
Tree created in 5.11 seconds, with 524288 leafs
--------------------------------
http://domain.com/x?0&1&2&3&4&5&6&7&8&9&10&11&12&13&14&15&16&17&18 has 20 parts
Tree created in 10.68 seconds, with 1048576 leafs
--------------------------------
For my part, I'll probably ingore that kind of complex URLs.
Same issue here. With a list of around 500 URLs, both CPU and memory start climbing and in our case, the Celery process gets killed after less than a minute.
same issue here with 124 urls
hey, we found following code will make urlclustering use lots of memory and ends up got killed by system. could you take a look? thanks