aigents / aigents-java

Aigents Java Core Platform
MIT License
29 stars 12 forks source link

extract title using html tags and metadata #18

Closed smigad closed 4 years ago

smigad commented 4 years ago

the approach is to obtain <title> , all <h1> tags and <meta> with property og:title Priority given to h1 as long as it has at least 75% similarity with either <title> or og:title if not og:title is used if it has at least 75% similarity with <title> because respectable sites put proper metadata about the page. The string similarity metric is Levenshtein distance implemented in util.Str

if everything else above fails and empty title is obtained the final text would be obtained using the previous method.

@akolonin please take a deeper look since I still do not have a good understanding of the entire codebase.

if this is acceptable, #15 will be the next task

akolonin commented 4 years ago

Well, I think I see the fundamental concern with the suggested "title assignment" approach. Look at the https://www.nytimes.com/ - it has few tens of articles where each article may be source of of a news item. In such case, neither title, nor og:title nor h1 may work, so you need all possible candidates from the html title down to the "text" except. What is suggested in #15 is to use the "spatially closest title candidate preceding the text position", like it is done with images (while the images are looked up for closet before and after). Using Levenshtein difference may be (or may not) used as an alternative to using spatial difference to select the most similar title candidate preceding the text body by the spacial index instead the most spatially close, but: A) you should measure the different between the "text" and the "title candidate"; B) similarity should it be based on words, instead of letters?