HTML parsing libs in Painless

seanstory commented 1 month ago

Description

Relates to https://github.com/elastic/crawler/issues/144

Elastic has a few web crawlers (App Search Crawler, Elastic Web Crawler, Open Crawler). The Elastic Web Crawler has a feature to store full HTML, and we'll likely be adding the same feature to the Open Crawler at some point in the future.

The feature request is to make it easier for a user to utilize html content in Elasticsearch fields. Currently, if I wanted to write a script query or use the ScriptProcessor on an HTML field, I'd need to parse the content with a regex. That's not a great way to deal with HTML. Instead it would be nice to expose some java or groovy library for dealing with HTML as an object. Use case examples:

finding the text value of a specific element
counting the number of times a class/element is present on the page
stripping headers/footers from the page
removing embedded javascript
obfuscating PII that might be embedded in the HTML in certain elements

elasticsearchmachine commented 1 month ago

Pinging @elastic/es-core-infra (Team:Core/Infra)

rjernst commented 1 month ago

The ask seems reasonable, but I want to clarify something on expectations

it would be nice to expose some java or groovy library for dealing with HTML as an object

Painless is not versioned, so any changes to APIs must be done carefully. For this reason, in the past we have not exposed libraries directly. Instead, we introduce our own APIs that we can control, backed by whatever internal implementation we wish to use.

In this specific case, that means this is not a small ask. HTML parsing is not trivial, and APIs for it are often complex (lots of options, how do you handle errors, how lenient to be, etc).

elastic / elasticsearch

HTML parsing libs in Painless #113132

Description