NVIDIA / nv-ingest

NVIDIA Ingest is an early access set of microservices for parsing hundreds of thousands of complex, messy unstructured PDFs and other enterprise documents into metadata and text to embed into retrieval systems.
Apache License 2.0
92 stars 42 forks source link

[FEA]: Support arbitrary python functions to determine document split points #194

Open randerzander opened 1 month ago

randerzander commented 1 month ago

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

Currently preventing usage

Please provide a clear description of problem this feature solves

In an ideal world we'd have cleanly extracted document section header metadata including location.

Then we could use the location of such document demarcations to support splitting on those demarcations.

However, some separators are arbitrary text content not likely to ever be identified by an ootb model. As a result, users would like to be able to run an arbitrary python function which can return split locations.

Describe the feature, and optionally a solution or implementation and any alternatives

See above

Additional context

No response