devstein / langchain

⚡ Building applications with LLMs through composability ⚡
MIT License
1 stars 0 forks source link

Make DirectoryLoader to read file in parallel to reduce file reading time #9

Open devstein opened 1 year ago

devstein commented 1 year ago

How can I read the files in parallel to speed up the process

langchain/langchain/document_loaders/directory.py

dosubot[bot] commented 1 year ago

File: langchain/document_loaders/directory.py

To make DirectoryLoader read files in parallel, you can use the concurrent.futures library. Here's how you can update the load method in the DirectoryLoader class:

  1. Import the necessary libraries at the beginning of the file:

    from concurrent.futures import ThreadPoolExecutor, as_completed
  2. Update the load method to use a ThreadPoolExecutor for parallel file reading:

def load(self) -> List[Document]:
    """Load documents."""
    p = Path(self.path)
    docs = []
    items = list(p.rglob(self.glob) if self.recursive else p.glob(self.glob))

    pbar = None
    if self.show_progress:
        try:
            from tqdm import tqdm

            pbar = tqdm(total=len(items))
        except ImportError as e:
            logger.warning(
                "To log the progress of DirectoryLoader you need to install tqdm, "
                "`pip install tqdm`"
            )
            if self.silent_errors:
                logger.warning(e)
            else:
                raise e

    # Create a ThreadPoolExecutor for parallel file reading
    with ThreadPoolExecutor() as executor:
        # Create a dictionary to store the future objects and their corresponding file paths
        future_to_path = {executor.submit(self.loader_cls(str(i), **self.loader_kwargs).load): i for i in items if i.is_file() and (_is_visible(i.relative_to(p)) or self.load_hidden)}

        # Iterate through the completed futures and process their results
        for future in as_completed(future_to_path):
            i = future_to_path[future]
            try:
                sub_docs = future.result()
                docs.extend(sub_docs)
            except Exception as e:
                if self.silent_errors:
                    logger.warning(e)
                else:
                    raise e
            finally:
                if pbar:
                    pbar.update(1)

    if pbar:
        pbar.close()

    return docs

This update will allow the DirectoryLoader to read files in parallel, which should reduce the file reading time.