elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.09k stars 24.83k forks source link

HTML parsing libs in Painless #113132

Open seanstory opened 1 month ago

seanstory commented 1 month ago

Description

Relates to https://github.com/elastic/crawler/issues/144

Elastic has a few web crawlers (App Search Crawler, Elastic Web Crawler, Open Crawler). The Elastic Web Crawler has a feature to store full HTML, and we'll likely be adding the same feature to the Open Crawler at some point in the future.

The feature request is to make it easier for a user to utilize html content in Elasticsearch fields. Currently, if I wanted to write a script query or use the ScriptProcessor on an HTML field, I'd need to parse the content with a regex. That's not a great way to deal with HTML. Instead it would be nice to expose some java or groovy library for dealing with HTML as an object. Use case examples:

elasticsearchmachine commented 1 month ago

Pinging @elastic/es-core-infra (Team:Core/Infra)

rjernst commented 1 month ago

The ask seems reasonable, but I want to clarify something on expectations

it would be nice to expose some java or groovy library for dealing with HTML as an object

Painless is not versioned, so any changes to APIs must be done carefully. For this reason, in the past we have not exposed libraries directly. Instead, we introduce our own APIs that we can control, backed by whatever internal implementation we wish to use.

In this specific case, that means this is not a small ask. HTML parsing is not trivial, and APIs for it are often complex (lots of options, how do you handle errors, how lenient to be, etc).