NVIDIA Ingest is an early access set of microservices for parsing hundreds of thousands of complex, messy unstructured PDFs and other enterprise documents into metadata and text to embed into retrieval systems.
Apache License 2.0
92
stars
42
forks
source link
[FEA]: Support arbitrary python functions to determine document split points #194
Is this a new feature, an improvement, or a change to existing functionality?
New Feature
How would you describe the priority of this feature request
Currently preventing usage
Please provide a clear description of problem this feature solves
In an ideal world we'd have cleanly extracted document section header metadata including location.
Then we could use the location of such document demarcations to support splitting on those demarcations.
However, some separators are arbitrary text content not likely to ever be identified by an ootb model. As a result, users would like to be able to run an arbitrary python function which can return split locations.
Describe the feature, and optionally a solution or implementation and any alternatives
Is this a new feature, an improvement, or a change to existing functionality?
New Feature
How would you describe the priority of this feature request
Currently preventing usage
Please provide a clear description of problem this feature solves
In an ideal world we'd have cleanly extracted document section header metadata including location.
Then we could use the location of such document demarcations to support splitting on those demarcations.
However, some separators are arbitrary text content not likely to ever be identified by an ootb model. As a result, users would like to be able to run an arbitrary python function which can return split locations.
Describe the feature, and optionally a solution or implementation and any alternatives
See above
Additional context
No response