PromtEngineer / localGPT

Chat with your documents on your local device using GPT models. No data leaves your device and 100% private.
Apache License 2.0
19.54k stars 2.19k forks source link

Handling ingestion file types #171

Open teleprint-me opened 1 year ago

teleprint-me commented 1 year ago

Issue: Handling ingestion file types

The issue is handling the variety of file types.

e.g.

DOCUMENT_MAP = {
    ".txt": TextLoader,
    ".py": TextLoader,
    ".pdf": PDFMinerLoader,
    ".csv": CSVLoader,
    ".xls": UnstructuredExcelLoader,
    ".xlxs": UnstructuredExcelLoader,
}

This can quickly get out of hand for obvious reasons.

Potential Solution 1: The Decorator Pattern

Use a Decorator pattern to integrate MIME types for files.

import os

# ... other imports ...

# Define the folder for storing database
ROOT_DIRECTORY = os.path.dirname(os.path.realpath(__file__))
SOURCE_DIRECTORY = f"{ROOT_DIRECTORY}/SOURCE_DOCUMENTS"
PERSIST_DIRECTORY = f"{ROOT_DIRECTORY}/DB"
INGEST_THREADS = os.cpu_count() or 8

# Define the Chroma settings
CHROMA_SETTINGS = Settings(
    chroma_db_impl="duckdb+parquet",
    persist_directory=PERSIST_DIRECTORY,
    anonymized_telemetry=False,
)

# ... other settings ...

# Default Instructor Model
EMBEDDING_MODEL_NAME = "hkunlp/instructor-large"

def loader_for(*extensions):
    def decorator(cls):
        for ext in extensions:
            DOCUMENT_MAP[ext] = cls
        return cls
    return decorator

@loader_for(".txt", ".py")
class TextLoader:
    pass

@loader_for(".pdf")
class PDFMinerLoader:
    pass

@loader_for(".csv")
class CSVLoader:
    pass

@loader_for(".xls", ".xlxs")
class UnstructuredExcelLoader:
    pass

Potential Solution 2: Use a Registry

Use a Registry that can automate mappings between MIME types for files.

e.g.

class LoaderRegistry:
    def __init__(self):
        self.loader_map = defaultdict(list)

    def register_loader(self, extension, loader_class):
        self.loader_map[extension].append(loader_class)

    def get_loader(self, extension):
        loader_classes = self.loader_map.get(extension)
        if loader_classes:
            return loader_classes[0]  # Return the first matching loader class
        else:
            return None

# Create an instance of the LoaderRegistry
loader_registry = LoaderRegistry()

# Register the Loader classes with the LoaderRegistry

# Assuming the following import statements for the Loader classes
from langchain.document_loaders import (
    CSVLoader,
    PDFMinerLoader,
    TextLoader,
    UnstructuredExcelLoader,
)

# Register the Loader classes with their respective file extensions
loader_registry.register_loader(".txt", TextLoader)
loader_registry.register_loader(".py", TextLoader)

Solutions 1 and 2 aren't really ideal because they'll require updating and maintenance which will not address the root problem.

Potential Solution 3: Use both a Registry and MIME types

My preferred solution would be to use MIME types to follow standardization and allow for portability and flexibility.

We can use MIME types instead of relying on file extensions. MIME types already have a standard we can follow and all we would need to do is map each Loader class to the appropriate MIME type as a result.

class LoaderRegistry:
    def __init__(self):
        self.loader_map = defaultdict(list)

    def register_loader(self, mime_type, loader_class):
        self.loader_map[mime_type].append(loader_class)

    def get_loader(self, mime_type):
        loader_classes = self.loader_map.get(mime_type)
        if loader_classes:
            return loader_classes[0]  # Return the first matching loader class
        else:
            return None

# Create an instance of the LoaderRegistry
loader_registry = LoaderRegistry()

# Register the Loader classes with the LoaderRegistry using MIME types
loader_registry.register_loader("text/plain", TextLoader)
loader_registry.register_loader("application/pdf", PDFMinerLoader)
loader_registry.register_loader("text/csv", CSVLoader)
loader_registry.register_loader("application/vnd.ms-excel", UnstructuredExcelLoader)

# Determine the MIME type of the file
file_path = "path/to/file.pdf"  # Example file path
mime_type = get_mime_type(file_path)

# Retrieve the appropriate Loader class based on the MIME type
loader_class = loader_registry.get_loader(mime_type)
if loader_class:
    # Create an instance of the Loader class and use it
    loader = loader_class()
    # ... use the loader ...
else:
    print("No Loader class found for the given MIME type.")

The benefits of this approach mostly reduce maintenance and updates. We no longer need to add a file extension for each Loader class as a result.

chfix commented 1 year ago

could you please elaborate, into exactly what snippets of code needs to be changed ? in order to update the file ingestion

teleprint-me commented 1 year ago

You can check out my dev branch where I'm experimenting with the full source.

https://github.com/teleprint-me/localGPT/tree/dev/localGPT

I'm doing my best to clean it up and streamline it.

We'll also need to handle text splitting as well for certain sources.

We can apply a similar pattern, but we'll obviously need to apply it to that context as well to keep that clean, maintainable, and updatable.

This would be related to issues #147, #151, #157, #165, etcetera.

There are more issues obviously... like handling CLI options and getting the UI to behave accordingly. My focus is mostly ingesting in this case.

My dev branch is just to explore all possible solutions to current open issues and pull requests. I'm not expecting anything out of it and plan on reusing it else where.

I'm just posting my results to contribute back.

teleprint-me commented 1 year ago

The following are the languages support by langchain for document splitting:

# langchain/text_splitter.py

class Language(str, Enum):
    CPP = "cpp"
    GO = "go"
    JAVA = "java"
    JS = "js"
    PHP = "php"
    PROTO = "proto"
    PYTHON = "python"
    RST = "rst"
    RUBY = "ruby"
    RUST = "rust"
    SCALA = "scala"
    SWIFT = "swift"
    MARKDOWN = "markdown"
    LATEX = "latex"
    HTML = "html"
    SOL = "sol"

This creates a finite amount of languages (I'm sure it will be extended over time).

So, constants might look something like this:

# localGPT/constants.py
# A tuple of tuples associating MIME types with loader classes
# Each inner tuple consists of a MIME type string and a loader class
# NOTE: <type>[<type>[], ...] syntax states that you expect a <type>
# that can contain any number of inner <type>s.
MIME_TYPES: Tuple[Tuple[str, Type[BaseLoader]], ...] = (
    ("text/plain", TextLoader),
    ("application/pdf", PDFMinerLoader),
    ("text/csv", CSVLoader),
    ("application/vnd.ms-excel", UnstructuredExcelLoader),
)

# `str` is the file extension and Language is the Enum mapped to it
LANGUAGE_TYPES: Tuple[Tuple[str, str], ...] = (
    ("cpp", Language.CPP),  # C++ source files
    ("go", Language.GO),  # Go source files
    ("java", Language.JAVA),  # Java source files
    ("js", Language.JS),  # JavaScript source files
    ("php", Language.PHP),  # PHP source files
    ("proto", Language.PROTO),  # Protocol Buffers files
    ("py", Language.PYTHON),  # Python source files
    ("rst", Language.RST),  # reStructuredText files
    ("rb", Language.RUBY),  # Ruby source files
    ("rs", Language.RUST),  # Rust source files
    ("scala", Language.SCALA),  # Scala source files
    ("swift", Language.SWIFT),  # Swift source files
    ("md", Language.MARKDOWN),  # Markdown files
    ("tex", Language.LATEX),  # LaTeX files
    ("html", Language.HTML),  # HTML files
    ("sol", Language.SOL),  # Solidity files
)

So, we could extend the registry to handle this as well.

# localGPT/registry.py
class TextSplitterRegistry:
    """
    A registry for languages based on file extensions.
    """

    def __init__(self):
        """
        Initializes the TextSplitterRegistry.
        """
        self.language_map = defaultdict()

        # Register languages for file extensions
        for file_extension, language in LANGUAGE_TYPES:
            self.register_language(file_extension, language)

    def register_language(
        self,
        file_extension: str,
        language: str,
    ) -> None:
        """
        Registers a language for a specific file extension.

        Args:
            file_extension (str): The file extension to register the language for.
            language (str): The language to register.
        """
        self.language_map[file_extension] = language

    def get_language(
        self,
        file_extension: str,
    ) -> Optional[str]:
        """
        Returns the language for a specific file extension.

        Args:
            file_extension (str): The file extension to retrieve the language for.

        Returns:
            Optional[str]: The language if found, None otherwise.
        """
        return self.language_map.get(file_extension)

And then just modify the text splitter in ingest.py:

def split_documents(documents: List[Document]) -> List[Document]:
    """
    Splits the given documents based on their type for the correct Text Splitter.

    Args:
        documents (List[Document]): The list of documents to split.

    Returns:
        List[Document]: A list of split documents.
    """
    logging.info(f"Splitting: {[doc.metadata['source'] for doc in documents]}")

    text_docs, python_docs = [], []
    loader_registry = LoaderRegistry()

    for doc in documents:
        logging.info(f"Splitting: {doc.metadata['source']}")
        mime_type = loader_registry.get_mime_type(doc.metadata["source"])
        logging.info(f"Splitting: {mime_type}")
        loader_class = loader_registry.get_loader(mime_type)

        if isinstance(loader_class, TextLoader):
            if loader_registry.has_extension(doc, "py"):
                python_docs.append(doc)
            else:
                text_docs.append(doc)
        else:
            text_docs.append(doc)

    # NOTE: Splitters should be abstracted to allow plug n' play
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000, chunk_overlap=200
    )
    python_splitter = RecursiveCharacterTextSplitter.from_language(
        language=Language.PYTHON, chunk_size=1000, chunk_overlap=200
    )
    text_documents = text_splitter.split_documents(text_docs)
    python_documents = python_splitter.split_documents(python_docs)

    return text_documents + python_documents

This is just a rough sketch, but could be cleaned up:

def split_documents(documents: List[Document]) -> List[Document]:
    """
    Splits the given documents based on their type for the correct Text Splitter.

    Args:
        documents (List[Document]): The list of documents to split.

    Returns:
        List[Document]: A list of split documents.
    """
    logging.info(f"Splitting: {[doc.metadata['source'] for doc in documents]}")

    loader_registry = LoaderRegistry()
    splitter_registry = TextSplitterRegistry()

    split_documents = []
    for doc in documents:
        logging.info(f"Splitting: {doc.metadata['source']}")
        file_extension = os.path.splitext(doc.metadata["source"])[1][1:]  # Get file extension without dot
        language = splitter_registry.get_language(file_extension)

        # If we have a language for this file extension, use a language-specific splitter
        if language is not None:
            splitter = RecursiveCharacterTextSplitter.from_language(
                language=language, chunk_size=1000, chunk_overlap=200
            )
        # Otherwise, use a default text splitter
        else:
            splitter = RecursiveCharacterTextSplitter(
                chunk_size=1000, chunk_overlap=200
            )

        split_documents.extend(splitter.split_documents([doc]))

    return split_documents
teleprint-me commented 1 year ago

So, I got ingest fully refactored and operational. It functions as expected and now allows users to have more control and granularity via the command line options.

I'm working on the run script now. I'm hoping I'll have that working as expected by tonight.

Then I'll take a look at the API and how my changes affected the UI, if at all.

The refactoring was so deep and consequential that I had to fix the ripple effects as a result.

Everything is up on my dev branch except for the run script, as I stated before that I'm still working on it.

PromtEngineer commented 1 year ago

@teleprint-me can you look at this #173. This is the idea we have for moving forward. There will be a base localGPT class and we can build cli and api applications on top of it. I had a quick look at your dev branch and like the implementation as well. See if we can integrate both of them.

Thanks,

teleprint-me commented 1 year ago

@PromtEngineer, I've reviewed issue #173 and appreciate the direction it's heading in. I've been developing some related ideas on the dev branch as well.

I'd recommend you dive a bit deeper into my dev branch. I've been focusing on making the code more flexible and adaptable. From my experience, too much encapsulation can sometimes box us in.

My code isn't fully polished yet, but if you want to proceed with the changes in issue #173 for now, that's totally fine. Small, consistent changes often work better than one big overhaul.

I created the dev branch as a space to experiment and explore. I've read the docs, watched your YouTube video, and I'm not trying to steer your project off course.

Let me know what aspects you like and what you're aiming for, and I'll do my best to contribute in a way that aligns with the project's goals.

PromtEngineer commented 1 year ago

@teleprint-me I looked at your code and I like your approach as well. I think it will be great to combine both of these to get the best of both approaches.

One thing I wanted to do was to set all parameters in a single place, that's why we put them in config.py. That way the user doesn't have to set them multiples time across ingest.py & run_localGPT.py.

I really liked the way you implemented a separate ModleLoader class. I think if we can integrate it within #173 that will be really great. Looking at your code, I am also inclined toward separating document processing and registry.

My ultimate goal is not only to make it a tool that enables you to chat with your documents but to be able to build tools on top of it. That's why I liked @LeafmanZ idea to have a single lcoalgpt class and then build cli & api on it. May be we can figure out a way to keep #173 more flexible and adaptable.

I am open to suggestions and would love to figure out the architecture that we can build on. We can use this space to discuss. What do you think @teleprint-me @LeafmanZ

teleprint-me commented 1 year ago

Ingest script is complete and operational.

__init__.py supercedes constants.py now.

I'm going to finish up implementation of the run.py script and test it today.

Then I'm going to look into supporting GGML which should address quantized models. Implementation of llama.cpp is a priority in that context.

It would address issues #92 and #111 and add support for alternative quantized models.

Not sure if should use issue #92 to address this or open a new issue for dev branch seeing as I've resolved most issues with ingest.py.

Plus, I'm getting ready to look into GPU support for OpenCL for GGML format since I'm participating in the GPU thread.

It also helps with research for my genesis outline.

I can open a PR for my dev branch since you seem interested in merging my results. It would save me some time as well instead of starting from scratch again. If not, I can open a new branch, apply the relevant changes, and then make a PR for the branch instead.

LMK.

LeafmanZ commented 1 year ago

Wow yeah this is awesome. I think this is all really darn good! Since ingest is also separate from the localgpt class most of these changes wont be too big of a lift to integrate.

PromtEngineer commented 1 year ago

@teleprint-me @LeafmanZ Here is what I am thinking. To make it easier to maintain and update, rather than combining both codes, we go with the dev branch. We will remove all the 'click' options to make it cleaner and just put them either in config.py/constant.py or __init__.py.

I like the direction in which its going. Support for GGML format will be really great.

@teleprint-me please open a PR for the dev branch. Let me know when its ready to test. I will run it across different platforms.

@LeafmanZ can you please look into bringing in your API and UI changes when we merge this PR.

Thank you to all of you for making this better with every update.

LeafmanZ commented 1 year ago

I agree that we go with the route @teleprint-me has designed.

teleprint-me commented 1 year ago

@PromtEngineer @LeafmanZ

Thank you for the constructive feedback and the openness to consider my contributions to the project.

I've read your thoughts on the CLI options. I truly believe in the value they provide, offering users a greater degree of control and flexibility. I've put considerable work into these, aiming to make them as beneficial as possible for all users.

In response to the desire for centralization, I've actually moved all the options, including the CLI ones, into the __init__.py file. I believe this setup combines extensibility, maintainability, and reusability, reflecting the project's goals.

Since you last viewed the code, there have been additional cleanups and refactoring. The CLI options are now more intuitive and less intrusive.

I'm almost ready to open a PR. Just a few final tweaks are in the pipeline to ensure smooth functionality.

Of course, this is all just my perspective. If there are other aspects you would like me to consider or any further suggestions, I'm more than happy to discuss.

PromtEngineer commented 1 year ago

@teleprint-me I just saw your PR, let me have a look at it.

In terms of the CLI options, I can see them being useful later on, specially if we were to convert this into a standup package. For most of the users, it will be more user friend to set these parameters in a single place. I agree __init__.py addresses that need.

teleprint-me commented 1 year ago

I'm currently working out some issues I'm running into with AutoGPTQ.

Unfortunately, I haven't been able to get the run.py script to execute. Maybe you guys could take a look.

Keep in mind I'm new to ML and AI and I'm still learning. I literally just learned about safetensors and haven't had the opportunity to look under the hood yet.

I'm learning about all of this stuff because I'm ultimately interested in making local models that can run on low to mid range consumer hardware from scratch. I've still got a lot to learn, but that's my end goal.

I'm talking about making a 10 - 120 Million parameter model useful (which seems absurd and farfetched at the moment, but I believe it can be done).

Anyways, just wanted to let you know it's not ready yet and I'll update you guys as I progress. If you find anything, lmk and I'll see what I can do.

Mathwhizdude commented 1 year ago

Looks good I'm new as well.

PromtEngineer commented 1 year ago

@teleprint-me I think there is a bug in run.py and __init__.py.

Even if you are not using a quantized model, possible values of model_safetensors are limited to the ones provided in CHOICE_MODEL_SAFETENSORS. If you provide None or don't provide a value, it will use the default value and results in an error.

Inside __init__.py, you have the following defaults:

DEFAULT_MODEL_REPOSITORY: str = "TheBloke/WizardLM-7B-V1.0-Uncensored-GGML" DEFAULT_MODEL_SAFETENSORS: str = ( "WizardLM-7B-uncensored-GPTQ-4bit-128g.compat.no-act-order.safetensors" )

But in the readme you highlighted that with GGML not to use safetensors.

I was testing this on M2 and am running into some other issues as well but probably they are Apple specific. For some reasons this version runs lot slower on my machine compared to the previous ones but the memory usage is very similar. I am still trying to figure out what's causing that. Have you noticed anything like that?

I will spend some more time on it tomorrow to see if notice anything else.

We are all in the same boat :-) Happy learning!

teleprint-me commented 1 year ago

Yeah, I got ahead of myself because I was preparing to include llama.cpp. I still needed to iron out the ModelLoader class and the constants as you found out.

There's a setting you need to flip for the M2. You're experiencing a common issue. I unfortunately cannot remember where I read it admist the swath of docs and code I've found myself in. I made a mental note, but was doing something else at the time.

That and I'm in the middle of trying to figure out the issues related to bitsandbytes because I just ran into it (again).

I just documented some of the stuff in issue #167 because I'm pretty sure that bitsandbytes is whats causing it.