aadarsh-ram / my-gpt-backend

Backend for My-GPT - a local LLM-based application used for Summarization, Chat QA and Grammar Checks along with RabbitMQ
0 stars 0 forks source link

Upload different file types #3

Closed aadarsh-ram closed 10 months ago

aadarsh-ram commented 11 months ago

Currently, only parsing PDFs is supported. Use Langchain (maybe) for loading different file types.

aerial-ace1 commented 11 months ago

What exactly do you mean by loading diff file types? Do you want us to work on parsing diff file types to text? Also langchain is kinda focused towards SQL anol. Do you want that level of input availability or do we stick to files?

aadarsh-ram commented 11 months ago

What exactly do you mean by loading diff file types? Do you want us to work on parsing diff file types to text? Also langchain is kinda focused towards SQL anol. Do you want that level of input availability or do we stick to files?

We'll stick to files (such as Pdf, Doc and TXT) and the backend must be able to parse these different file types.

aerial-ace1 commented 10 months ago

@aadarsh-ram would smh like this work by extracting file type from filePath:

from langchain.document_loaders import PyPDFLoader
from langchain.document_loaders import Docx2txtLoader
from langchain.document_loaders import TextLoader
if file.endswith(".pdf"):
        pdf_path = "./docs/" + file
        loader = PyPDFLoader(pdf_path)
        documents.extend(loader.load())
    elif file.endswith('.docx') or file.endswith('.doc'):
        doc_path = "./docs/" + file
        loader = Docx2txtLoader(doc_path)
        documents.extend(loader.load())
    elif file.endswith('.txt'):
        text_path = "./docs/" + file
        loader = TextLoader(text_path)
        documents.extend(loader.load())
aadarsh-ram commented 10 months ago

Yep, that's what I thought. But, after loading, we need to pass the content to our parser which cleans some data out

aerial-ace1 commented 10 months ago

L