aigents / aigents-java

Aigents Java Core Platform
MIT License
29 stars 12 forks source link

Add "title" to "text" in news items #15

Closed akolonin closed 4 years ago

akolonin commented 4 years ago

Wanted Have news items supplied with "title" property, in addition to currently existing "text", "sources", "times" and "image".

One way to solve this is do the same trick as it is done with images and links - provide another container to the html stripper so it collects all tags that you have identified and keeps them with indexes to the original positions and then when the text is matched it can lookup back for the closest title candidate.

Here is where the image indexing happens: https://github.com/aigents/aigents-java/blob/master/src/main/java/net/webstructor/cat/HtmlStripper.java#L202

Here where is is used: https://github.com/aigents/aigents-java/blob/34a507feaf3846ec9a2d3b9e0f3afde9beaa813e/src/main/java/net/webstructor/self/Siter.java#L636

I guess one can just re-use the Imager class for the purpose. Then one just needs two hacks nearby the points that I have indicated:

  1. Index all "title", "h1", "h2", "h3" tags plus may be some other collecting their interiors in the collector structure same as called to collect the image urls. https://github.com/aigents/aigents-java/blob/master/src/main/java/net/webstructor/cat/HtmlStripper.java#L202

  2. When the news item is created, lookup the closest indexed title candidate occurring before, like it is done when attaching image urls: https://github.com/aigents/aigents-java/blob/34a507feaf3846ec9a2d3b9e0f3afde9beaa813e/src/main/java/net/webstructor/self/Siter.java#L636

  3. Put the found title candidate into the "title" property of the news item

4. Optionally: If no title candidate found, we MAY don't leave no title or create blank "title", but may use alternative strategy like using the most salient/interesting words in a text placed in the title in the same order as they appear in the text)

akolonin commented 4 years ago

@dagims - look at the https://www.nytimes.com/ - it has few tens of articles where each article may be source of of a news item. In such case, neither title, nor og:title nor h1 may work, so you need all possible candidates from the html title down to the "text" except. What is suggested in #15 is to use the "spatially closest title candidate preceding the text position", like it is done with images (while the images are looked up for closet before and after). Using Levenshtein difference may be (or may not) used as an alternative to using spatial difference to select the most similar title candidate preceding the text body by the spacial index instead the most spatially close, but: A) you should measure the different between the "text" and the "title candidate"; B) similarity should it be based on words, instead of letters?

smigad commented 4 years ago

@akolonin respectable sites such as nytimes.com actually do have all of them. These sites set the og:title meta to the correct title of an article. The og:title metadata is the most reliable one because it is used by professional sites and CMSs. Because of CMSs, this appies to a significant number of websites out there. The need for performing string similarity with letters is because of the specific type of problem the difference between the strings obtained from the different tags mentioned above presents. That is, when looking at a certain page, say this one, The following is the title tag:

<title data-rh="true">As Job Losses Mount, Lawmakers Face a Make-or-Break Moment - The New York Times</title>

and the following is the meta tag:

<meta data-rh="true" property="og:title" content="As Job Losses Mount, Lawmakers Face a Make-or-Break Moment"/>

and finally the following is the header

<h1 id="link-69abe703" class="css-1s4ffep e1h9rw200" itemProp="headline" data-test-id="headline">As Job Losses Mount, Lawmakers Face a Make-or-Break Moment</h1>

As you can see, all three contain the proper title but the <title> tag contains something extra which is the name of the website. It contains the name of the site for this case and similar others but sometimes it's the specific section of the site the article is in or some generic text. The need to compare the string similarity with letters was needed only to get rid of the extra part. I'm sure word similarity can be used instead and would perform equally well if not better but I was trying to minimize complexity and use an application specific solution. Going with word similarity will require parsing into words which might require some sort of dictionary look up or a similarly more complex algorithm than calculating edit distance.

On a separate note, I have a question about how and when topic matching is done. My question is which of the following two is correct? (if even one of them is a correct assumption :laughing:)

  1. Topics are matched while performing crawling. That is, while each web page is being read, it is searched for the topics the user trusts and if there is a match, the page data is stored for presenting to the user later on.
  2. First all pages from trusted sites is crawled and the extracted text, links, images etc... is stored then follows the matching going through the stored extracted text looking for the topics the user is interested in.
akolonin commented 4 years ago

@dagims 1) I am saying not "don't use og:title", but "use title candidates which may be spatially and semantically closer to the piece of content than og:title, title, or h1". 2) we work with not "respectable sites" but with "any html pages" which is a big difference. 3) we are matching not an "articles" but "news items" where "bitcoin is rising because of..." and "dollar is sinking because of ..." are the news items which may be appearing in the same article with title "market news" and such title may be not the most precise title candidate for these news items.

akolonin commented 4 years ago

The "matching" reality is much more complex than either "just 1" or "just 2" because of multiple users having multiple trusted topics while the crawl process is shared between the users plus the new topics may appear between the crawls. Logically, the "just 1" is the right view, but: A) for each site, it considers all users trusting this site and collects topics from all of these users: https://github.com/aigents/aigents-java/blob/master/src/main/java/net/webstructor/self/Siter.java#L215, B) there is a page cache to prevent redundant re-reads of the same pages https://github.com/aigents/aigents-java/blob/master/src/main/java/net/webstructor/self/Siter.java#L442

akolonin commented 4 years ago

What still needs to be done in PR A) Eliminate unused symbols: https://github.com/aigents/aigents-java/pull/18/files#diff-3b52764c5dc3c3c1b07f581c6b0ab39fR59 B) Take care about too LOOOOOOONG titles https://github.com/aigents/aigents-java/pull/18/files#diff-c6a4981e6efdd02e8b67fb5cc85a19c9R549 So need to strip the first sentence using new function String Siter.shortTitle(String longtitle) { ... } which would (for the simplex implementation) use symbols in AL.punctuation https://github.com/aigents/aigents-java/blob/master/src/main/java/net/webstructor/al/AL.java#L142 then tokenize the longtitle using them then take the first token (in other words, get the first string which neither starts with punctuation not ends with punctuation but may contain spaces). Or you may invent something smarter :-) The other point - creation of such "default stupidly intelligent title" should happen AFTER the attempt to create title using your main logic based on HTML because if we overload shortTitle in the future with AGI math it will be too expensive to do that math in advanced. In other words, here is what to do in Siter: 1) title = null; 2) try to get title from HTML titler structures 3) if (AL.empty(title)) title = siter.shortTitle(nl_text);

smigad commented 4 years ago

@akolonin I decided to modify the regex that checks for AL.pronounciation rather than using ParseS because it's a much more efficient.

akolonin commented 4 years ago

Completed in https://github.com/aigents/aigents-java/commit/a88d59731c67a8b21d0972d31f67414a037d7047 Many improvements may come later.