HTML pipeline processor

seanstory commented 1 month ago

Relates to https://github.com/elastic/crawler/issues/144 relates to https://github.com/elastic/elasticsearch/issues/113132

Elastic has a few web crawlers (App Search Crawler, Elastic Web Crawler, Open Crawler). The Elastic Web Crawler has a feature to store full HTML, and we'll likely be adding the same feature to the Open Crawler at some point in the future.

The feature request is to make it easier for a user to utilize html content in Elasticsearch fields without having to write code to parse HTML. Currently, if I wanted to do some processing of an HTML field in an ingest pipeline, I'd need to use a ScriptProcessor on an HTML field, using regexes. https://github.com/elastic/elasticsearch/issues/113132 would make it easier to do this in a script processor. But some users would prefer to not have to get so in-the weeds for more simple HTML processing tasks. Common usecases might include:

removing specific elements and their children (for dropping headers, footers, and ads)
pulling specific element text into other fields (like getting <h1> or <title> into a title field)

elasticsearchmachine commented 1 month ago

Pinging @elastic/es-data-management (Team:Data Management)

dakrone commented 1 month ago

Random drive-by thought, but I wonder whether it'd be possible to parse HTML in such a manner as to make it work with ObjectPath.java?

elastic / elasticsearch

HTML pipeline processor #113133