Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.44k stars 580 forks source link

feat/partition_metadata #2933

Open Falven opened 2 months ago

Falven commented 2 months ago

Is your feature request related to a problem? Please describe. I need to be able to extract additional metadata from HTML documents. Specifically I would like to extract favicons and head > title elements.

Describe the solution you'd like Some flexible way to define additional metadata to extract per document type. Text file types could be via regex (as currently seemingly supported), html via selectors, etc.

Describe alternatives you've considered Doing it post partitioning, before indexing, but it's not elegant nor efficient.

Additional context Even using LLM's to extract metadata as orchestration frameworks support would be great.

adieuadieu commented 2 months ago

I've also wanted this. The title, but also meta tags like the keywords and description, and the og tags. Currently I fetch the URL myself, parse these things out with beautifulsoup, then pass the response text to partition for the rest. But, would somehow be nicer if partition_html could return these things in a more structured way. Especially for title, would be nice if it came back as an e.g. PageTitle (or, I guess HTMLHeadTitle ?) element type, or something like that.