elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.69k stars 24.66k forks source link

HTML pipeline processor #113133

Open seanstory opened 1 week ago

seanstory commented 1 week ago

Relates to https://github.com/elastic/crawler/issues/144 relates to https://github.com/elastic/elasticsearch/issues/113132

Elastic has a few web crawlers (App Search Crawler, Elastic Web Crawler, Open Crawler). The Elastic Web Crawler has a feature to store full HTML, and we'll likely be adding the same feature to the Open Crawler at some point in the future.

The feature request is to make it easier for a user to utilize html content in Elasticsearch fields without having to write code to parse HTML. Currently, if I wanted to do some processing of an HTML field in an ingest pipeline, I'd need to use a ScriptProcessor on an HTML field, using regexes. https://github.com/elastic/elasticsearch/issues/113132 would make it easier to do this in a script processor. But some users would prefer to not have to get so in-the weeds for more simple HTML processing tasks. Common usecases might include:

elasticsearchmachine commented 1 week ago

Pinging @elastic/es-data-management (Team:Data Management)

dakrone commented 1 week ago

Random drive-by thought, but I wonder whether it'd be possible to parse HTML in such a manner as to make it work with ObjectPath.java?