NVIDIA / nv-ingest

NVIDIA Ingest is an early access set of microservices for parsing hundreds of thousands of complex, messy unstructured PDFs and other enterprise documents into metadata and text to embed into retrieval systems.
Apache License 2.0
87 stars 40 forks source link

Add doughnut http endpoint #230

Open edknv opened 1 week ago

edknv commented 1 week ago

Description

Part of https://github.com/NVIDIA/nv-ingest-private/issues/52 Also closes https://github.com/NVIDIA/nv-ingest-private/issues/49.

Checklist

edknv commented 11 hours ago

321052f adds support for preserving the text bounding boxes in the metadata (in the hierarchy field) and addresses issue https://github.com/NVIDIA/nv-ingest-private/issues/49. It also changes the way text blocks are concatenated to concatenate text blocks with \n\n as requested by research team.