aigents / aigents-java

Aigents Java Core Platform
MIT License
29 stars 12 forks source link

Provide images more relevant to texts and headers of the news items #35

Open akolonin opened 4 years ago

akolonin commented 4 years ago

Real Problem: Currently, the value of image supplied for news items with values of title, text, and sources (link) may be relevant to the text and title or not. This is because the image is located with ContentLocator based on logic found in Matcher: https://github.com/aigents/aigents-java/blob/master/src/main/java/net/webstructor/self/Matcher.java#L170 The logic expects proximity of the image to the located text in terms of raw HTML text, and not on spatial proximity in terms of visual appearance in HTML browser or semantic proximity from human point of view.

Need to search for a way to improve the current behavior, while we can't have HLAI exposed to virtual pages generated by a virtual browser and pretending the HLAI is seeing the texts and images the same way as humans do.

Possible Solutions:

  1. Evaluate image proximity by title, if present, and only if the title is not present, then use the text.
  2. Give precedence to larger images, so if there are two images that are close to text (or title), use the larger image. Possibly, use complex metric of "applicability" of an image where "applicability" = "size" / "distance", so closer are and larger images are appearing more applicable - but this will need to load and analyze images or image attributes at least (bearing in mind that attributes may be missed in HTML).
  3. Try to use proximity based om positions is parsed/stripped text, instead of proximity based on positions in eaw HTML.
  4. Disregard wide and tall images, one where width > height 2 or height > width 2 - but this will need to load and analyze images or image attributes at least (bearing in mind that attributes may be missed in HTML).
  5. Simulate 2D layout computation algorithm employed by web browser, with account to HTML and CSS specifications so every matched text and ever image on a page are given 2D coordinates, then we can do proximity computation based on visual distance. Make sure the distance is computed in regard to image boundaries and not image centers (otherwise smaller images may be gaining precedence).
  6. Consider relying on extra hints in HTML structure even though this is expected to be very unreliable, being obscured with css styling policies.
  7. TBD any other options that would come to mind...