daremon / urlclustering

Package to facilitate URL clustering
MIT License
68 stars 27 forks source link

memory issue for some urls #9

Open tpeng opened 7 years ago

tpeng commented 7 years ago

hey, we found following code will make urlclustering use lots of memory and ends up got killed by system. could you take a look? thanks


import urlclustering

lines="""
http://www.digitecno.it/webconference.aspx
http://www.digitecno.it/pec.aspx
http://www.digitecno.it/rss.ashx
http://www.digitecno.it/formainforma/formainforma.aspx?Id=230&a=&archi=&cad=&cont=&crm=&da=&des=&gest=&gis=&infra=&ing=&mec=&sw=&termo=&testo=&tipo=
http://www.digitecno.it/formainforma/formainforma.aspx?Id=208&a=&archi=&cad=&cont=&crm=&da=&des=&gest=&gis=&infra=&ing=&mec=&sw=&termo=&testo=&tipo=
http://www.digitecno.it/formainforma/formainforma.aspx?Id=231&a=&archi=&cad=&cont=&crm=&da=&des=&gest=&gis=&infra=&ing=&mec=&sw=&termo=&testo=&tipo=
http://www.digitecno.it/
http://www.digitecno.it/mappasito.aspx
http://www.digitecno.it/login.aspx?ReturnUrl=%2friservata%2fdefault.aspx
http://www.digitecno.it/default.aspx
http://www.digitecno.it/azienda.aspx
http://www.digitecno.it/softwareservizi.aspx
http://www.digitecno.it/educational.aspx
http://www.digitecno.it/assistenza.aspx
http://www.digitecno.it/utility/default.aspx
http://www.digitecno.it/formainforma/formainforma.aspx?Id=156&a=&archi=&cad=&cont=&crm=&da=&des=&gest=&gis=&infra=&ing=&mec=&sw=&termo=&testo=&tipo=
http://www.digitecno.it/formainforma/formainforma.aspx?Id=157&a=&archi=&cad=&cont=&crm=&da=&des=&gest=&gis=&infra=&ing=&mec=&sw=&termo=&testo=&tipo=
http://www.digitecno.it/formainforma/formainforma.aspx?Id=158&a=&archi=&cad=&cont=&crm=&da=&des=&gest=&gis=&infra=&ing=&mec=&sw=&termo=&testo=&tipo=
http://www.digitecno.it/formainforma/formainforma.aspx?Id=153&a=&archi=&cad=&cont=&crm=&da=&des=&gest=&gis=&infra=&ing=&mec=&sw=&termo=&testo=&tipo=
http://www.digitecno.it/formainforma/formainforma.aspx?Id=228&a=&archi=&cad=&cont=&crm=&da=&des=&gest=&gis=&infra=&ing=&mec=&sw=&termo=&testo=&tipo=
http://www.digitecno.it/offerte/offerta.aspx?Id=83&a=&archi=&cad=&cont=&crm=&da=&des=&gest=&gis=&infra=&ing=&mec=&sw=&termo=&testo=&tipo=
http://www.digitecno.it/offerte/listaofferte.aspx?a=&archi=&cad=&cont=&crm=&da=&des=&edu=&gest=&gis=&infra=&ing=&mec=&sw=&termo=&testo=
http://www.digitecno.it/politicaprivacy/
http://www.digitecno.it/formainforma/listaformainforma.aspx?a=&archi=&cad=&cont=&crm=&da=&des=&gest=&gis=&infra=&ing=&mec=&sw=&termo=&testo=&tipo=0
http://www.digitecno.it/formainforma/autodesk.aspx
http://www.digitecno.it/downloadgadget.aspx
http://www.digitecno.it/formainforma/formainforma.aspx?Id=187&a=&archi=&cad=&cont=&crm=&da=&des=&gest=&gis=&infra=&ing=&mec=&sw=&termo=&testo=&tipo=
http://www.digitecno.it/formainforma/formainforma.aspx?Id=223&a=&archi=&cad=&cont=&crm=&da=&des=&gest=&gis=&infra=&ing=&mec=&sw=&termo=&testo=&tipo=
http://www.digitecno.it/formainforma/listaformainforma.aspx?a=&archi=&cad=&cont=&crm=&da=&des=&gest=&gis=&infra=&ing=&mec=&sw=&termo=&testo=&tipo=
"""
c = urlclustering.cluster(lines.splitlines())
benjastudio commented 5 years ago

Hey, I'm facing up the same issue. I have the explanation (but no solution to give sorry). It's mainly due to the tree structure used in the algo and the complexity of your URLs.

So, the number of tree-nodes created is 2^n-1, and the number of leafs is 2^n. Where n is the number of parts in the path (words separated by /) and the query (words separated by & or = or ?) of the given URL.

On my computer, When n is 20, it takes around 10 seconds and ~2+ GB of memory to build the tree:

import time
from urlclustering.urltree import URLTreeNode
from urlclustering.parsedurl import ParsedURL

for i in range(15, 20):
    t = time.time()
    url = "http://domain.com/x?" + "&".join([str(x) for x in range(i)])
    parsed_url = ParsedURL(url)
    tree = URLTreeNode()
    tree.add_url(parsed_url)
    print("{} has {} parts".format(url, i + 1, time.time() - t))
    print("Tree created in {:.2f} seconds, with {} leafs".format(time.time() - t, len(tree.leafs())))
    print("-" * 32)

Output:

http://domain.com/x?0&1&2&3&4&5&6&7&8&9&10&11&12&13&14 has 16 parts
Tree created in 0.53 seconds, with 65536 leafs
--------------------------------
http://domain.com/x?0&1&2&3&4&5&6&7&8&9&10&11&12&13&14&15 has 17 parts
Tree created in 1.15 seconds, with 131072 leafs
--------------------------------
http://domain.com/x?0&1&2&3&4&5&6&7&8&9&10&11&12&13&14&15&16 has 18 parts
Tree created in 2.52 seconds, with 262144 leafs
--------------------------------
http://domain.com/x?0&1&2&3&4&5&6&7&8&9&10&11&12&13&14&15&16&17 has 19 parts
Tree created in 5.11 seconds, with 524288 leafs
--------------------------------
http://domain.com/x?0&1&2&3&4&5&6&7&8&9&10&11&12&13&14&15&16&17&18 has 20 parts
Tree created in 10.68 seconds, with 1048576 leafs
--------------------------------

For my part, I'll probably ingore that kind of complex URLs.

rafa-munoz commented 3 years ago

Same issue here. With a list of around 500 URLs, both CPU and memory start climbing and in our case, the Celery process gets killed after less than a minute.

ifduyue commented 1 year ago

same issue here with 124 urls