Open Falven opened 2 months ago
I've also wanted this. The
title, but also meta tags like the keywords and description, and the og tags. Currently I fetch the URL myself, parse these things out with beautifulsoup, then pass the response text topartition
for the rest. But, would somehow be nicer if partition_html
could return these things in a more structured way. Especially for title, would be nice if it came back as an e.g. PageTitle
(or, I guess HTMLHeadTitle
?) element type, or something like that.
Is your feature request related to a problem? Please describe. I need to be able to extract additional metadata from HTML documents. Specifically I would like to extract favicons and
head > title
elements.Describe the solution you'd like Some flexible way to define additional metadata to extract per document type. Text file types could be via regex (as currently seemingly supported), html via selectors, etc.
Describe alternatives you've considered Doing it post partitioning, before indexing, but it's not elegant nor efficient.
Additional context Even using LLM's to extract metadata as orchestration frameworks support would be great.