[FEA]: Add python multi file API for job submission

NVIDIA / nv-ingest

NVIDIA Ingest is a set of microservices for parsing hundreds of thousands of complex, messy unstructured PDFs and other enterprise documents into metadata and text to embed into retrieval systems.

Apache License 2.0

39 stars 11 forks source link

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

Significant improvement

Please provide a clear description of problem this feature solves

Our current Python job submission API operates on single files at a time (https://github.com/nvidia/nv-ingest?tab=readme-ov-file#step-3-ingesting-documents)

For those concerned w/ maximum throughput, this might suggest suboptimal performance.

Describe the feature, and optionally a solution or implementation and any alternatives

We should provide a Python API that supports submitting multiple files at once.

It should support both lists of files, and a directory.

Additional context

No response

Is there a way we could add a lambda read function. What I am trying to do here is support use cases that may be outside the target right now. We are currently heavy on supporting pdf files. It would be good if we could supply an optional lambda to this api that extracted the content as a list and that list could be processed into JobSpecs. The goal here would be to allow other filetypes like jsonl, dataframes, numpy files and other custom formats.

For instance for jsonl files I used the following simple method to load data:

import json

def load_json(file_path): 
    data_json = []
    with open(file_path) as json_data:
        for line in json_data:
            j_line = json.loads(line)
            data_json.append((j_line['_id'], j_line['text']))
    return data_json

Feel free to push back.

NVIDIA / nv-ingest