Support sub directories in ingestion

bee-san commented 1 year ago

Hey!

I'd really like to be able to the ingestion script recursively on all sub-directories. I don't think it does this currently.

teleprint-me commented 1 year ago

Hi @bee-san,

Your idea to support recursive ingestion from all sub-directories can enhance the flexibility of our tool.

To implement this, we could utilize the os.scandir() function in the Python's standard os library. This function allows us to efficiently traverse through a directory tree, checking each entry to see if it's a directory or a file. If it's a directory, we can recursively process the directory, and if it's a file, we can process the file accordingly. Here's a rough pseudocode outline for the modification:

def process_directory(path):
    for entry in os.scandir(path):
        if entry.is_dir(follow_symlinks=False):
            process_directory(entry.path)
        elif entry.is_file(follow_symlinks=False):
            process_file(entry.path)

If you have additional insights or a different approach in mind, please share. We're always open to new ideas and contributions. Alternatively, we can start to design a detailed plan for implementing this feature based on the provided pseudocode. We would love to hear your thoughts!

bee-san commented 1 year ago

Hi @bee-san,

Your idea to support recursive ingestion from all sub-directories can enhance the flexibility of our tool.

To implement this, we could utilize the os.scandir() function in the Python's standard os library. This function allows us to efficiently traverse through a directory tree, checking each entry to see if it's a directory or a file. If it's a directory, we can recursively process the directory, and if it's a file, we can process the file accordingly. Here's a rough pseudocode outline for the modification:
def process_directory(path):
    for entry in os.scandir(path):
        if entry.is_dir(follow_symlinks=False):
            process_directory(entry.path)
        elif entry.is_file(follow_symlinks=False):
            process_file(entry.path)
If you have additional insights or a different approach in mind, please share. We're always open to new ideas and contributions. Alternatively, we can start to design a detailed plan for implementing this feature based on the provided pseudocode. We would love to hear your thoughts!

Thanks ChatGPT 🤣

teleprint-me commented 1 year ago

I love how it always exaggerates and embellishes things. It's usually a dead give away. 😇

teleprint-me commented 1 year ago

@bee-san @PromtEngineer @LeafmanZ

def load_documents(source_dir: str) -> list[Document]:
    """
    Loads all documents from the specified source documents directory.

    Args:
        source_dir (str): The path to the source documents directory.

    Returns:
        List[Document]: A list of loaded documents.

    Raises:
        ValueError: If the document type is undefined.
    """
    paths = []

    logging.info(f"Loading documents: {source_dir}")

    for dirpath, dirnames, filenames in os.walk(source_dir):
        for file_path in filenames:
            source_file_path = os.path.join(dirpath, file_path)
            mime_type = loader_registry.get_mime_type(source_file_path)
            loader_class = loader_registry.get_loader(mime_type)

            logging.info(f"Detected {mime_type} for {file_path}")

            if loader_class:
                logging.info(f"Loading {source_file_path}")
                paths.append(source_file_path)

    # Have at least one worker and at most CPU_COUNT workers
    n_workers = min(CPU_COUNT, max(len(paths), 1))
    chunk_size = round(len(paths) / n_workers)
    docs = []

    with ProcessPoolExecutor(n_workers) as executor:
        futures = []
        # split the load operations into chunks
        for i in range(0, len(paths), chunk_size):
            # select a chunk of filenames
            filepaths = paths[i : (i + chunk_size)]
            # submit the task
            future = executor.submit(load_document_batch, filepaths)
            futures.append(future)
        # process all results
        for future in as_completed(futures):
            # open the file and load the data
            contents, _ = future.result()
            docs.extend(contents)

    return docs

This source code recursively searches through all sub-directories for source files and ingests them.

bannsec commented 12 months ago

For anyone looking for a temporary solution for this, here's a python script that will recursively flatten the directory.

#!/usr/bin/env python3

import os
import argparse
import shutil

def rename_files(path):
    for root, dirs, files in os.walk(path):
        print(f"Processing directory: {root}, {dirs}, {files}")
        root_strip = os.path.split(root)[1:]
        root_replace = "_".join(root_strip)
        for file in files:
            original = os.path.join(root, file)
            new = os.path.join(path, root_replace + "_" + file)
            print(f"Renaming {original} to {new}")
            os.rename(original, new)

def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument('path', help='Path to the directory to flatten')
    return parser.parse_args()

def remove_directories(path):
    for root, dirs, files in os.walk(path, topdown=False):
        for dir in dirs:
            dir_path = os.path.join(root, dir)
            print(f"Removing directory: {dir_path}")
            shutil.rmtree(dir_path)

def main():
    args = parse_args()
    rename_files(args.path)
    remove_directories(args.path)

if __name__ == '__main__':
    main()

PromtEngineer / localGPT

Support sub directories in ingestion #179