Open devstein opened 1 year ago
File: langchain/document_loaders/directory.py
To make DirectoryLoader read files in parallel, you can use the concurrent.futures
library. Here's how you can update the load
method in the DirectoryLoader
class:
Import the necessary libraries at the beginning of the file:
from concurrent.futures import ThreadPoolExecutor, as_completed
Update the load
method to use a ThreadPoolExecutor for parallel file reading:
def load(self) -> List[Document]:
"""Load documents."""
p = Path(self.path)
docs = []
items = list(p.rglob(self.glob) if self.recursive else p.glob(self.glob))
pbar = None
if self.show_progress:
try:
from tqdm import tqdm
pbar = tqdm(total=len(items))
except ImportError as e:
logger.warning(
"To log the progress of DirectoryLoader you need to install tqdm, "
"`pip install tqdm`"
)
if self.silent_errors:
logger.warning(e)
else:
raise e
# Create a ThreadPoolExecutor for parallel file reading
with ThreadPoolExecutor() as executor:
# Create a dictionary to store the future objects and their corresponding file paths
future_to_path = {executor.submit(self.loader_cls(str(i), **self.loader_kwargs).load): i for i in items if i.is_file() and (_is_visible(i.relative_to(p)) or self.load_hidden)}
# Iterate through the completed futures and process their results
for future in as_completed(future_to_path):
i = future_to_path[future]
try:
sub_docs = future.result()
docs.extend(sub_docs)
except Exception as e:
if self.silent_errors:
logger.warning(e)
else:
raise e
finally:
if pbar:
pbar.update(1)
if pbar:
pbar.close()
return docs
This update will allow the DirectoryLoader to read files in parallel, which should reduce the file reading time.
How can I read the files in parallel to speed up the process
langchain/langchain/document_loaders/directory.py