AmenRa / retriv

A Python Search Engine for Humans 🥸
MIT License
174 stars 20 forks source link

[Feature Request] build index on a sequence of json/jsonl files #27

Closed MarshtompCS closed 11 months ago

MarshtompCS commented 11 months ago

When a corpus contains a very large quantity of documents, we usually split it into multiple files. I wonder if it is possible to support inputing a sequence of json/jsonl files to build index.

MarshtompCS commented 11 months ago

I have implemented this by passing a generator that iterates over multiple files for building index

AmenRa commented 11 months ago

Would you mind posting your solution for other people? Thank you.

MarshtompCS commented 11 months ago

Would you mind posting your solution for other people? Thank you.

Sure! The solution is like:

  1. Define a iterator over multiple files
def many_files_line_iterator(files_list, callback=None):
    for file in files_list:
        open(file, "r") as fn:
            for line in fn.readlines():
                line = json.loads(line)
                if callback:
                    yield callback(line)
                else:
                    yield line
  1. pass this iterator as collection
files_list = ["path_to_jsonl_0", "path_to_jsonl_1"]
search_engine.index(many_files_line_iterator(files_list, callback=None))