feat/partition_metadata

Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.

Apache License 2.0

7.44k stars 580 forks source link

Is your feature request related to a problem? Please describe. I need to be able to extract additional metadata from HTML documents. Specifically I would like to extract favicons and head > title elements.

Describe the solution you'd like Some flexible way to define additional metadata to extract per document type. Text file types could be via regex (as currently seemingly supported), html via selectors, etc.

Describe alternatives you've considered Doing it post partitioning, before indexing, but it's not elegant nor efficient.

Additional context Even using LLM's to extract metadata as orchestration frameworks support would be great.

Unstructured-IO / unstructured

feat/partition_metadata #2933