Closed smigad closed 4 years ago
Well, I think I see the fundamental concern with the suggested "title assignment" approach. Look at the https://www.nytimes.com/ - it has few tens of articles where each article may be source of of a news item. In such case, neither title, nor og:title nor h1 may work, so you need all possible candidates from the html title down to the "text" except. What is suggested in #15 is to use the "spatially closest title candidate preceding the text position", like it is done with images (while the images are looked up for closet before and after). Using Levenshtein difference may be (or may not) used as an alternative to using spatial difference to select the most similar title candidate preceding the text body by the spacial index instead the most spatially close, but: A) you should measure the different between the "text" and the "title candidate"; B) similarity should it be based on words, instead of letters?
the approach is to obtain
<title>
, all<h1>
tags and<meta>
with propertyog:title
Priority given to h1 as long as it has at least 75% similarity with either<title>
orog:title
if notog:title
is used if it has at least 75% similarity with<title>
because respectable sites put proper metadata about the page. The string similarity metric is Levenshtein distance implemented in util.Strif everything else above fails and empty title is obtained the final text would be obtained using the previous method.
@akolonin please take a deeper look since I still do not have a good understanding of the entire codebase.
if this is acceptable, #15 will be the next task