Implementation of the paper: Fabio Fumarola, Tim Weninger, Rick Barber, Donato Malerba, Jiawei Han: HyLiEn: a hybrid approach to general list extraction on the web. WWW (Companion Volume) 2011: 35-36
nodes can be aligned but not children of the same node. (look-ahead) [Implemented in the method LookAhead.aligned]
it does not considers if the records shares some tokens. This is true for records related to products or other deep web databases (this can be useful for alignment).
for each list element, all its children are similar then it invalidate the current list and declare its children as list. (This can be dangeous, probably it is better do adopt Depta strategy which uses styles to split false positive lists)
If the parent is a <p> tag it is possible the list is a false positive [Implemented]
filter lists with different orientation but with equal position and size, to take the list with more elements
it uses horizontal or vertical rules, boxes, colored panels, special fonts, or background images to understand if candidate elements are records. [implemented]
The DOM tree of a page is built and enriched with visual features: position, background colour, foreground colour, font information [implemented]
Applies 12 Rules applied to detect subregions
Data records are very similar in their appearances:
APPEARANCE FEATURES:
images have a similar size,
text aligned uses the same fonts
(interesting for tiled): data items of different data records have similar presentation with respect to position, size of images and data items, and font (text data items) and list membership
CONTENT FEATURES:
the presentation of data items in data records follows a fixed order. This is related to tags and font rules
there are often some fixed static texts in data records which are not from the Hidden Web Database but are put in the presentation to identify semantic information.
Algorithm:
Data records are extracted using Layout and Appearance features.
Phase 1: filter noise blocks
Phase 2: cluster the remaining blocks using a similarity functions based on image size, plain text font and link text font.
Phase 3: regroups data records in each cluster based on their appearance order (from left to right, from top to bottom)
PROPOSALS:
MDR reasoning
LookAhead.aligned
]<p>
tag it is possible the list is a false positive [Implemented]