Open seanstory opened 1 month ago
Pinging @elastic/es-core-infra (Team:Core/Infra)
The ask seems reasonable, but I want to clarify something on expectations
it would be nice to expose some java or groovy library for dealing with HTML as an object
Painless is not versioned, so any changes to APIs must be done carefully. For this reason, in the past we have not exposed libraries directly. Instead, we introduce our own APIs that we can control, backed by whatever internal implementation we wish to use.
In this specific case, that means this is not a small ask. HTML parsing is not trivial, and APIs for it are often complex (lots of options, how do you handle errors, how lenient to be, etc).
Description
Relates to https://github.com/elastic/crawler/issues/144
Elastic has a few web crawlers (App Search Crawler, Elastic Web Crawler, Open Crawler). The Elastic Web Crawler has a feature to store full HTML, and we'll likely be adding the same feature to the Open Crawler at some point in the future.
The feature request is to make it easier for a user to utilize html content in Elasticsearch fields. Currently, if I wanted to write a script query or use the ScriptProcessor on an HTML field, I'd need to parse the content with a regex. That's not a great way to deal with HTML. Instead it would be nice to expose some java or groovy library for dealing with HTML as an object. Use case examples: