Open randerzander opened 4 days ago
Is there a way we could add a lambda read function. What I am trying to do here is support use cases that may be outside the target right now. We are currently heavy on supporting pdf files. It would be good if we could supply an optional lambda to this api that extracted the content as a list and that list could be processed into JobSpecs. The goal here would be to allow other filetypes like jsonl, dataframes, numpy files and other custom formats.
For instance for jsonl files I used the following simple method to load data:
import json
def load_json(file_path):
data_json = []
with open(file_path) as json_data:
for line in json_data:
j_line = json.loads(line)
data_json.append((j_line['_id'], j_line['text']))
return data_json
Feel free to push back.
Is this a new feature, an improvement, or a change to existing functionality?
New Feature
How would you describe the priority of this feature request
Significant improvement
Please provide a clear description of problem this feature solves
Our current Python job submission API operates on single files at a time (https://github.com/nvidia/nv-ingest?tab=readme-ov-file#step-3-ingesting-documents)
For those concerned w/ maximum throughput, this might suggest suboptimal performance.
Describe the feature, and optionally a solution or implementation and any alternatives
We should provide a Python API that supports submitting multiple files at once.
It should support both lists of files, and a directory.
Additional context
No response