elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
976 stars 24.82k forks source link

HTML pipeline processor #113133

Open seanstory opened 1 month ago

seanstory commented 1 month ago

Relates to https://github.com/elastic/crawler/issues/144 relates to https://github.com/elastic/elasticsearch/issues/113132

Elastic has a few web crawlers (App Search Crawler, Elastic Web Crawler, Open Crawler). The Elastic Web Crawler has a feature to store full HTML, and we'll likely be adding the same feature to the Open Crawler at some point in the future.

The feature request is to make it easier for a user to utilize html content in Elasticsearch fields without having to write code to parse HTML. Currently, if I wanted to do some processing of an HTML field in an ingest pipeline, I'd need to use a ScriptProcessor on an HTML field, using regexes. https://github.com/elastic/elasticsearch/issues/113132 would make it easier to do this in a script processor. But some users would prefer to not have to get so in-the weeds for more simple HTML processing tasks. Common usecases might include:

elasticsearchmachine commented 1 month ago

Pinging @elastic/es-data-management (Team:Data Management)

dakrone commented 1 month ago

Random drive-by thought, but I wonder whether it'd be possible to parse HTML in such a manner as to make it work with ObjectPath.java?