fabiofumarola / HyLiEn

Implementation of the paper: Fabio Fumarola, Tim Weninger, Rick Barber, Donato Malerba, Jiawei Han: HyLiEn: a hybrid approach to general list extraction on the web. WWW (Companion Volume) 2011: 35-36
Apache License 2.0
1 stars 2 forks source link

Second iteration on HyLien #7

Open fabiofumarola opened 8 years ago

fabiofumarola commented 8 years ago

PROPOSALS:

MDR reasoning

  1. nodes can be aligned but not children of the same node. (look-ahead) [Implemented in the method LookAhead.aligned]
  2. it does not considers if the records shares some tokens. This is true for records related to products or other deep web databases (this can be useful for alignment).
  3. for each list element, all its children are similar then it invalidate the current list and declare its children as list. (This can be dangeous, probably it is better do adopt Depta strategy which uses styles to split false positive lists)
  4. If the parent is a <p> tag it is possible the list is a false positive [Implemented]
  5. filter lists with different orientation but with equal position and size, to take the list with more elements
fabiofumarola commented 8 years ago

VIDE Reasoning:

  1. it uses horizontal or vertical rules, boxes, colored panels, special fonts, or background images to understand if candidate elements are records. [implemented]
  2. The DOM tree of a page is built and enriched with visual features: position, background colour, foreground colour, font information [implemented]
  3. Applies 12 Rules applied to detect subregions
  4. Data records are very similar in their appearances:

APPEARANCE FEATURES:

CONTENT FEATURES:

Algorithm:

  1. Data records are extracted using Layout and Appearance features.
  2. Phase 1: filter noise blocks
  3. Phase 2: cluster the remaining blocks using a similarity functions based on image size, plain text font and link text font.
  4. Phase 3: regroups data records in each cluster based on their appearance order (from left to right, from top to bottom)