Filimoa / open-parse

Improved file parsing for LLM’s
https://filimoa.github.io/open-parse/
MIT License
2.3k stars 87 forks source link

add langchain document support #56

Open priamai opened 1 month ago

priamai commented 1 month ago

Description

Love the project, we need to add a langchain Document interface, which I am more than happy to do it but just a few questions:

What is the embedding field for? Will that be filled eventually with an openai embedding vector? What are tokens and how they are calculated base on what model? are you using tiktoken? Within each node you have something called Lines, is that basically the text but split into detected lines?

Cheers.

Filimoa commented 1 month ago

Great!

embedding is used for semantic processing (combining chunks by similarity) - yes it's a vector from OpenAI (long term maybe agnostic).

Tiktoken is our current method for calculating tokens since (unfortunately) semantic processing is OpenAI centric at the moment.

I wouldn't worry about lines - they're used internally to assemble nodes. Once the node is created they're no longer needed.

Feel free to ask anything else!

priamai commented 1 month ago

@Filimoa enjoy this simple class that is compatible.

from typing import AsyncIterator, Iterator

from langchain_core.document_loaders import BaseLoader
from langchain_core.documents import Document

import openparse

class OpenParseDocumentLoader(BaseLoader):
    """An example document loader that reads a file line by line."""

    def __init__(self, file_path: str) -> None:
        """Initialize the loader with a file path.

        Args:
            file_path: The path to the file to load.
        """
        self.file_path = file_path

    def lazy_load(self) -> Iterator[Document]:  # <-- Does not take any arguments
        """A lazy loader that reads a file line by line.

        When you're implementing lazy load methods, you should use a generator
        to yield documents one by one.
        """

        parser = openparse.DocumentParser()
        parsed_basic_doc = parser.parse(self.file_path)

        for node in parsed_basic_doc.nodes:
            yield Document(
                page_content=node.text,
                metadata={"tokens": node.tokens,
                          "num_pages":node.num_pages,
                          "node_id":node.node_id,
                          "start_page":node.start_page,
                          "end_page":node.end_page,
                          "source": self.file_path},
            )

Usage:


from OpenTextLoader import OpenParseDocumentLoader

loader = OpenParseDocumentLoader("./sample_docs/companies-list.pdf")

## Test out the lazy load interface
for doc in loader.lazy_load():
    print()
    print(type(doc))
    print(doc)

Feel free to add to the code base.

ITHealer commented 4 weeks ago

How do I extract tabels and images from a pdf??