HTML Content Extraction

DasUberLeo commented 2 months ago

Problem Description

Crawler has the ability to store full pages as HTML, but often only subsets of HTML are useful. For example many sites have key content in xpath(*//main), and current tooling allows us to extract this as text, but not as HTML. Once extracted as HTML, additional work can be undertaken to convert to JSON or Markdown or semantically separated chunks of text.

Proposed Solution

An option to have content extraction extract content as HTML.

Alternatives

Ingestion pipeline elements to handle HTML
Painless capabilities to handle HTML
Workarounds outside of platform

seanstory commented 2 months ago

@DasUberLeo thanks for filing! One of Elastic Crawler's newer features was being able to store the full HTML of a page in a field, so that you could post-process it. I don't think that feature has made it to Open Crawler yet, but it's on the roadmap. Would that accomplish what you're looking for?

Ingestion pipeline elements to handle HTML Painless capabilities to handle HTML

These seem like the more important asks, and we probably need a separate issue for those. In my experience, having raw HTML is only semi-useful, since your only option in ingest pipelines or script queries is to use regexes. And that's an awful way to parse HTML.

seanstory commented 2 months ago

I've filed https://github.com/elastic/elasticsearch/issues/113132 and https://github.com/elastic/elasticsearch/issues/113133 to track those, suggestions, explicitly.

DasUberLeo commented 2 months ago

Thanks @seanstory!

I think it'd be great if I wanted to crawl this Search Labs blog post, but only wanted to the key article text. If I used content extraction on xpath(//article) I would get raw text string for the article, ie:

Blog / Integrations Elasticsearch open Inference API adds support for AlibabaCloud AI Search We are excited to announce our latest addition to the Elasticsearch Open Inference API: the integration of AlibabaCloud AI Search. This work enables...

It would be awesome if there was a configuration so I could extract this as:

<article class="flex flex-col space-y-16"><div><div class="eyebrow mb-3"><a href="/search-labs/blog">Blog</a> / <a href="/search-labs/blog/category/integrations">Integrations</a></div><h1 class="font-bold leading-tighter text-3xl md:text-5xl"><span>Elasticsearch open Inference API adds support for AlibabaCloud AI Search</span></h1></div><div class="lg:grid lg:grid-cols-4 lg:gap-8 items-start relative mt-12"><div class="lg:col-span-3 w-full mx-auto flex flex-col gap-16"><div class="lg:col-span-3 w-full mx-auto flex flex-col gap-8"><div class="prose prose-invert text-white article-content"><p>We are excited to announce our latest addition to the Elasticsearch Open Inference API: the integration of AlibabaCloud AI Search. This work enables...

In a perfect world I would actually be able to extract this also as markdown - this makes the text much more useful to LLMs and way less verbose than HTML.

seanstory commented 2 months ago

I'd expect you could form an xpath or css selector to get the article-content div's text, instead of the full article element. Using Chrome's dev tools, the selector for that element looks to be: #__next > main > main > article > div.lg\:grid.lg\:grid-cols-4.lg\:gap-8.items-start.relative.mt-12 > div.lg\:col-span-3.w-full.mx-auto.flex.flex-col.gap-16 > div.lg\:col-span-3.w-full.mx-auto.flex.flex-col.gap-8 > div.prose.prose-invert.text-white.article-content or the xpath could be: //*[@id="__next"]/main/main/article/div[2]/div[1]/div[1]/div[1] Neither are great, but just wanting to suggest options until better tooling is available.

In a perfect world I would actually be able to extract this also as markdown - this makes the text much more useful to LLMs

This is an interesting point. We've typically focused on extracting just the raw text and dropping formatting characters, because historically Elasticsearch's BM25 search is only hurt by formatting characters. But perhaps we need to revisit this with the growing popularity of LLMs. 🤔

elastic / crawler