Open bee-san opened 1 year ago
Hi @bee-san,
Your idea to support recursive ingestion from all sub-directories can enhance the flexibility of our tool.
To implement this, we could utilize the os.scandir()
function in the Python's standard os
library. This function allows us to efficiently traverse through a directory tree, checking each entry to see if it's a directory or a file. If it's a directory, we can recursively process the directory, and if it's a file, we can process the file accordingly. Here's a rough pseudocode outline for the modification:
def process_directory(path):
for entry in os.scandir(path):
if entry.is_dir(follow_symlinks=False):
process_directory(entry.path)
elif entry.is_file(follow_symlinks=False):
process_file(entry.path)
If you have additional insights or a different approach in mind, please share. We're always open to new ideas and contributions. Alternatively, we can start to design a detailed plan for implementing this feature based on the provided pseudocode. We would love to hear your thoughts!
Hi @bee-san,
Your idea to support recursive ingestion from all sub-directories can enhance the flexibility of our tool.
To implement this, we could utilize the
os.scandir()
function in the Python's standardos
library. This function allows us to efficiently traverse through a directory tree, checking each entry to see if it's a directory or a file. If it's a directory, we can recursively process the directory, and if it's a file, we can process the file accordingly. Here's a rough pseudocode outline for the modification:def process_directory(path): for entry in os.scandir(path): if entry.is_dir(follow_symlinks=False): process_directory(entry.path) elif entry.is_file(follow_symlinks=False): process_file(entry.path)
If you have additional insights or a different approach in mind, please share. We're always open to new ideas and contributions. Alternatively, we can start to design a detailed plan for implementing this feature based on the provided pseudocode. We would love to hear your thoughts!
Thanks ChatGPT 🤣
I love how it always exaggerates and embellishes things. It's usually a dead give away. 😇
@bee-san @PromtEngineer @LeafmanZ
def load_documents(source_dir: str) -> list[Document]:
"""
Loads all documents from the specified source documents directory.
Args:
source_dir (str): The path to the source documents directory.
Returns:
List[Document]: A list of loaded documents.
Raises:
ValueError: If the document type is undefined.
"""
paths = []
logging.info(f"Loading documents: {source_dir}")
for dirpath, dirnames, filenames in os.walk(source_dir):
for file_path in filenames:
source_file_path = os.path.join(dirpath, file_path)
mime_type = loader_registry.get_mime_type(source_file_path)
loader_class = loader_registry.get_loader(mime_type)
logging.info(f"Detected {mime_type} for {file_path}")
if loader_class:
logging.info(f"Loading {source_file_path}")
paths.append(source_file_path)
# Have at least one worker and at most CPU_COUNT workers
n_workers = min(CPU_COUNT, max(len(paths), 1))
chunk_size = round(len(paths) / n_workers)
docs = []
with ProcessPoolExecutor(n_workers) as executor:
futures = []
# split the load operations into chunks
for i in range(0, len(paths), chunk_size):
# select a chunk of filenames
filepaths = paths[i : (i + chunk_size)]
# submit the task
future = executor.submit(load_document_batch, filepaths)
futures.append(future)
# process all results
for future in as_completed(futures):
# open the file and load the data
contents, _ = future.result()
docs.extend(contents)
return docs
This source code recursively searches through all sub-directories for source files and ingests them.
For anyone looking for a temporary solution for this, here's a python script that will recursively flatten the directory.
#!/usr/bin/env python3
import os
import argparse
import shutil
def rename_files(path):
for root, dirs, files in os.walk(path):
print(f"Processing directory: {root}, {dirs}, {files}")
root_strip = os.path.split(root)[1:]
root_replace = "_".join(root_strip)
for file in files:
original = os.path.join(root, file)
new = os.path.join(path, root_replace + "_" + file)
print(f"Renaming {original} to {new}")
os.rename(original, new)
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument('path', help='Path to the directory to flatten')
return parser.parse_args()
def remove_directories(path):
for root, dirs, files in os.walk(path, topdown=False):
for dir in dirs:
dir_path = os.path.join(root, dir)
print(f"Removing directory: {dir_path}")
shutil.rmtree(dir_path)
def main():
args = parse_args()
rename_files(args.path)
remove_directories(args.path)
if __name__ == '__main__':
main()
Hey!
I'd really like to be able to the ingestion script recursively on all sub-directories. I don't think it does this currently.