TeamHG-Memex / page-compare

Simple heuristic for measuring web page similarity (& data set)
89 stars 18 forks source link

Feature request. TED algorithm #1

Open Sab0tag3d opened 4 years ago

Sab0tag3d commented 4 years ago

First of all, thanks for the amazing tool!

In the research of Thamme Gowda and Chris Mattmann they use ZhangShasha’s tree edit distance (TED) algorithm for comparing HTML's DOM trees. I've found the python library implements this algorithm: https://zhang-shasha.readthedocs.io/en/latest/#tree-format-and-usage

I think it could be more accurate than SequenceMatcher in difflib. But I ran into the problem of creating a node from HTML.

Do you have any thoughts or ideas about how to create a node from HTML? Do you think it could be helpful for compare?

mehaase commented 4 years ago

Thanks for the compliment! When I first wrote this I wanted to use a tree edit distance but I was short on time and I couldn't find anything that was already implemented or would be easy to implement from scratch. I'm not familiar with Zhang-Shasha but after a quick skim it looks like an interesting choice for this project.

Unfortunately I don't have time to tackle this for at least a month or two, but the integration should be feasible. The main issue is that the current code uses the API Element.iter() which produces a sequence, not a tree. To construct a tree, you'll need to do something like this (untested code—just a sketch of the idea):

from zss import Node, simple_distance
import lxml.html

html1 = lxml.html.parse('file1.html')
html2 = lxml.html.parse('file1.html')

def tree_from_el(el):
  node = Node(el.tag)
  for child in el:
    node.addkid(tree_from_el(child))
  return node

tree1 = tree_from_el(html1.getroot())
tree2 = tree_from_el(html2.getroot())
print(simple_distance(tree1, tree2))

If you get this working, let me know! I'm definitely keen to see the results.

I also skimmed the Gowda-Mattman paper you referenced (this one, right?) and I noticed this caveat:

The tree edit distance measure behaves precisely for a vast majority of documents. However the results have shown that the tree edit distance measure alone is not sufficient to accurately determine the similarity of web pages. For instance, web pages with uneven number of repeated groups tends to possess higher tree edit distance even though have same DOM structure at the higher level. Some of the examples of repeated groups are user com- ments to blog posts and listing of results in search pages. Even though two blog posts are similar in DOM structure, the unequal number of comments causes higher tree edit distance and thus lower structural similarity. However all the repeated groups possess similar styles, so the next sec- tion describes our proposed stylistic similarity measure to address these situations.

This problem they describe is also a problem with the current, sequence-based implementation, so it's not a deal-breaker. But the next section describes a similarity measure based on CSS styles, and then later they combine the two measures in order to improve the overall results. So those would be interesting extensions to apply to this library.

I also wonder if there is some modification to Zhang-Shasha that would collapse repeated subtrees? Some researchers at ScrapingHub have published something related to this but I can't find a citation at the moment.

Good luck!

Sab0tag3d commented 4 years ago

Thanks a lot for your detailed answer! I'll try your code and will report the results.

I've read this caveat and just want to test it, maybe it could be helpful as the third form of compare. I'm also not sure about speed of this algorithm. I need more tests here.

It that research they also wrote about javascript similarity but without detailed consideration. Do you think this could be helpful?

Did you mean this tool? https://github.com/scrapinghub/mdr

Sab0tag3d commented 4 years ago
from zss import Node, simple_distance
import lxml.html

html1 = lxml.html.parse('file1.html')
html2 = lxml.html.parse('file1.html')

def tree_from_el(el):
  node = Node(el.tag)
  for child in el:
    node.addkid(tree_from_el(child))
  return node

tree1 = tree_from_el(html1.getroot())
tree2 = tree_from_el(html2.getroot())
print(simple_distance(tree1, tree2))

Your script definitely works! It creates nodes and compares them. And yes, it consumes a lot of time for every compare (2-10sec). Now I should understand how to evaluate the numbers produced by the program in order to assess the similarity of the pages. Foк instance it shows me "756.0" (like in The Hitchhiker’s Guide to the Galaxy).