Open DasUberLeo opened 2 months ago
@DasUberLeo thanks for filing! One of Elastic Crawler's newer features was being able to store the full HTML of a page in a field, so that you could post-process it. I don't think that feature has made it to Open Crawler yet, but it's on the roadmap. Would that accomplish what you're looking for?
Ingestion pipeline elements to handle HTML Painless capabilities to handle HTML
These seem like the more important asks, and we probably need a separate issue for those. In my experience, having raw HTML is only semi-useful, since your only option in ingest pipelines or script queries is to use regexes. And that's an awful way to parse HTML.
I've filed https://github.com/elastic/elasticsearch/issues/113132 and https://github.com/elastic/elasticsearch/issues/113133 to track those, suggestions, explicitly.
Thanks @seanstory!
I think it'd be great if I wanted to crawl this Search Labs blog post, but only wanted to the key article text. If I used content extraction on xpath(//article)
I would get raw text string for the article, ie:
Blog / Integrations Elasticsearch open Inference API adds support for AlibabaCloud AI Search We are excited to announce our latest addition to the Elasticsearch Open Inference API: the integration of AlibabaCloud AI Search. This work enables...
It would be awesome if there was a configuration so I could extract this as:
<article class="flex flex-col space-y-16"><div><div class="eyebrow mb-3"><a href="/search-labs/blog">Blog</a> / <a href="/search-labs/blog/category/integrations">Integrations</a></div><h1 class="font-bold leading-tighter text-3xl md:text-5xl"><span>Elasticsearch open Inference API adds support for AlibabaCloud AI Search</span></h1></div><div class="lg:grid lg:grid-cols-4 lg:gap-8 items-start relative mt-12"><div class="lg:col-span-3 w-full mx-auto flex flex-col gap-16"><div class="lg:col-span-3 w-full mx-auto flex flex-col gap-8"><div class="prose prose-invert text-white article-content"><p>We are excited to announce our latest addition to the Elasticsearch Open Inference API: the integration of AlibabaCloud AI Search. This work enables...
In a perfect world I would actually be able to extract this also as markdown - this makes the text much more useful to LLMs and way less verbose than HTML.
I'd expect you could form an xpath or css selector to get the article-content
div's text, instead of the full article
element. Using Chrome's dev tools, the selector for that element looks to be: #__next > main > main > article > div.lg\:grid.lg\:grid-cols-4.lg\:gap-8.items-start.relative.mt-12 > div.lg\:col-span-3.w-full.mx-auto.flex.flex-col.gap-16 > div.lg\:col-span-3.w-full.mx-auto.flex.flex-col.gap-8 > div.prose.prose-invert.text-white.article-content
or the xpath could be: //*[@id="__next"]/main/main/article/div[2]/div[1]/div[1]/div[1]
Neither are great, but just wanting to suggest options until better tooling is available.
In a perfect world I would actually be able to extract this also as markdown - this makes the text much more useful to LLMs
This is an interesting point. We've typically focused on extracting just the raw text and dropping formatting characters, because historically Elasticsearch's BM25 search is only hurt by formatting characters. But perhaps we need to revisit this with the growing popularity of LLMs. 🤔
Problem Description
Crawler has the ability to store full pages as HTML, but often only subsets of HTML are useful. For example many sites have key content in xpath(*//main), and current tooling allows us to extract this as text, but not as HTML. Once extracted as HTML, additional work can be undertaken to convert to JSON or Markdown or semantically separated chunks of text.
Proposed Solution
An option to have content extraction extract content as HTML.
Alternatives