[Feature Request] build index on a sequence of json/jsonl files

MarshtompCS commented 11 months ago

When a corpus contains a very large quantity of documents, we usually split it into multiple files. I wonder if it is possible to support inputing a sequence of json/jsonl files to build index.

MarshtompCS commented 11 months ago

I have implemented this by passing a generator that iterates over multiple files for building index

AmenRa commented 11 months ago

Would you mind posting your solution for other people? Thank you.

MarshtompCS commented 11 months ago

Would you mind posting your solution for other people? Thank you.

Sure! The solution is like:

Define a iterator over multiple files

def many_files_line_iterator(files_list, callback=None):
    for file in files_list:
        open(file, "r") as fn:
            for line in fn.readlines():
                line = json.loads(line)
                if callback:
                    yield callback(line)
                else:
                    yield line

pass this iterator as collection

files_list = ["path_to_jsonl_0", "path_to_jsonl_1"]
search_engine.index(many_files_line_iterator(files_list, callback=None))

AmenRa / retriv

[Feature Request] build index on a sequence of json/jsonl files #27