aigents / aigents-java

Aigents Java Core Platform
MIT License
29 stars 12 forks source link

Smart Web Page Analysis #36

Open akolonin opened 4 years ago

akolonin commented 4 years ago

Goal There is a need to refactor/extend existing HTML stripper to have textual and semantic information extraction more reliable than it is currently happening in legacy HtmlStripper https://github.com/aigents/aigents-java/blob/master/src/main/java/net/webstructor/cat/HtmlStripper.java Each of the following sub-tasks may be considered as a separate issue and respective project.

Sub-tasks

  1. There is a need to extract schema.org embeddings in any possible representation (JSON-LD/microdata/RDFa)
  2. There is a need to extract structural information from the HTML markup
  3. There is a need to extract spatial html+css information from the loaded web page
  4. There is a need to extract DOM representation from web pages dynamically created by javascript/DHTML
  5. There is a need to extract semantic relationships from web pages, same as would be encoded with 1 (above) but using NLP and text mining techniques accompanied with 2, 3, 4 (above)

Sub-task details 1. There is a need to extract schema.org embeddings in any possible representation (JSON-LD/microdata/RDFa) Many of modern web pages contain lots of semantic information not visible to the human eye of a web user, according to specification in https://schema.org/ - the parser may be capable to extract this information when loading the page and apply the monitoring/extraction policies to the explicit semantic graph data rather than plain text.

2. There is a need to extract structural information from the HTML markup The existing HTML stripper blindly removes HTML tags, having some of them replaced with periods which makes it possible to do account for sentence and paragraph boundaries when doing the text pattern matching - in some cases. However, the use of HTML tags is site-specific and developer-specific, so this may not work in some cases. Fore more precise identification of the sentence boundaries, the hierarchical structure of a HTML document should be preserved in the stripped text, so the sentence/paragraph boundaries should be detected based on the structure of the hierarchical text and not on the presence of the tags.

3. There is a need to extract spatial HTML+CSS information from the loaded web page In some cases, the above maybe not enough because the relevance of particular pieces of texts to the images, links and even each other may be based not on spatial relationships between them in the HTML text body and even not in the hierarchical structure of it, but rather on 2-dimensional spatial proximity, provided by HTML+CSS markup rendered by the browser (with the account to screen resolution and layout). That means, the Ideal Web Page Analyser would simulate the real web browser computing coordinates of pixels for every element and scarping the screen elements the same way as a human eye would do.

4. There is a need to extract DOM representation from web pages dynamically created by JavaScript/DHTML All of the above may not work in case of WebPages generated by DHTML (such as https://aigents.com/ for instance), so there is a need to simulate the browser executing the complete suite of WWW technologies including CSS and JavaScript like it is done by Selenium WebDriver and WebKit - the simplex example of how it could be done is provided by https://github.com/aigents/aigents-java/blob/master/src/main/java/net/webstructor/util/WebKiter.java

5. There is a need to extract semantic relationships from web pages, same as would be encoded with 1 (above) but using NLP and text mining techniques accompanied with 2, 3, 4 (above) Since we can extract semantic relationships from the raw web page according to 1 (above), the entire process of Aigents web monitoring may be changed so the framework expects WepPage to be stripped down not to the plain text (like the HtmlStripper currently does), but rather to a subgraph of semantic relationship (like the Matcher is expected to do) - involving all of the techniques 2,3,4 (above). In such a case, we would end up with a design with semantic parsing of every web page and then subgraph monitoring and extraction applied to the page.