aigents / aigents-java

Aigents Java Core Platform
MIT License
29 stars 12 forks source link

Structured rich text stripping and matching #16

Open akolonin opened 4 years ago

akolonin commented 4 years ago

In order to better understand boundaries of the matching text spots, both HTML and PDF (and DOC, ODT, etc. in the future) rich texts should be stripped not to text (like HtmlStripper.convert does now), but to intermediate hierarchical representation preserving both structure of text organization and links, images and titles (kind of internal unified DOM representation).

Actions: 1) Add StructuredText class 2) Change HtmlStripper.convert to HtmlStripper.convertToStructuredText 3) Add PdfStripper.convertToStructuredText (instead of using PDFTextStripper) 4) Refactor the HttpFileReader and net.webstructor.self.Cacher so they get the structured data in StructuredText instead of "String text" in unified way 5) Fix/extend entire pattern matching kitchenery to use StructuredText instead of String 6) Make pattern matching kitchenery to use structure to understand the text spot boundaries 7) Make sure that unit texts are still passing and maybe fix them if needed

Note: Current HtmlStripper.convert inserts periods "." in the places of structured HTML tags but this is not done for PDF. Now it is the time to do this consistently for any rich text source, not breaking the other working parts.