Pdf Fomatting help - Githubissues

TaariqN commented 11 months ago

Hello, Not sure if my query should be placed here, apologies in advance,. I am quite new to this topic of llm and just found out about anythingllm ( impressive tool btw ). I would like to know if there is a need to format a pdf in a specific way / recommended way so that anythingllm or the language models themselves to be able to read and fetch information from it, such as for a QnA chatbot.

I noticed that with plain text, most of the time the answers are correct, but with uploaded pdf, the models might answer something beyond the scope of the pdf or not find the answer in the pdf, even when i am in query mode. I am unable to figure out if it is due to the pdf itself, or due to the fact that i am using opensource models ( mistral-7b ).

Thanks in advance. Taariq

timothycarambat commented 11 months ago

You can asses if the PDF content is being parsed at all by opening up storage/documents and find the PDF that was uploaded. We use PyMuPDF and sometimes some formats of PDFs just will not parse or fail to parse some page contents because of reasons that are not well documented currently.

What do you find there?

TaariqN commented 11 months ago

Hello, I do see the document in storage/documents/custom-documents in a json file and even in lancedb/<doc-name>.lance. The pdf was uploaded and was embedded too.

For instance in the document i had these lines

How to do an online search? Go to website name Scroll down click on search online free of charge (written in red)

In the query mode, i was expecting the exact answer when asking the same question, however it gave me a general answer of how to do a normal google search ( not needed in my context )

Kind Regards, Taariq

TaariqN commented 11 months ago

ok so going through the LocalAI logs, it seems that sometimes the context are not even loaded ( in query mode ), i think that's why it answers outside the pdf.


2023-12-12 13:37:44 You are a PDF / text QnA chatbot. Return only your response to the question given the above information following the users instructions as needed. Load the document before answering. Only answer if the answer is found in the document provided, else reply with "sorry i cant answer that please contact 5XXXXXXXX"
2023-12-12 13:37:44 Context:
2023-12-12 13:37:44     
2023-12-12 13:37:44 <|im_end|>
2023-12-12 13:37:44 
2023-12-12 13:37:44 <|im_start|>user
2023-12-12 13:37:44 how to do an online search
2023-12-12 13:37:44 <|im_end|>
2023-12-12 13:37:44 
2023-12-12 13:37:44 <|im_start|>assistant
2023-12-12 13:37:44```

timothycarambat commented 11 months ago

The JSON file has a pageContent key with text in it that is readable? By default the workspace has a pretty low matching filter so it should really generously find similarity snippets in the vector DB. What is yours set at?

timothycarambat commented 11 months ago

The fact context is empty for sure means its some filter or search not appearing before being sent to LocalAI - since the Context is empty.

TaariqN commented 11 months ago

Yes the page content is readable and i can see that the question concerning the online search was parsed. In my case the similarity score is the default one.

timothycarambat commented 11 months ago

Just to humor me, can you copy/paste some exact text from the content as a chat? Wondering how your question is missing a similarity search as it could be related to vector density.

How many words or tokens (approx) is the page content? It should be in the JSON file. Are you using the default embedding model or using a localAi hosted one?

TaariqN commented 11 months ago

Yes, the document is a public one available at https://companies.govmu.org/Documents/FAQ/FAQs.pdf In the json file the document is as such \n17. Which documents are required when a company changes its name? \nThe form F2-Change of Name should be submitted along with a copy of a special resolution \npassed by company, duly certified. \n \n18. How to do an online search? \nGo to companies.govmu.org \nScroll down click on search online free of charge (written in red) \n \n \n19. Is it possible to print information on a search online? \nYes, it is possible to print information. \n \n20. Who can apply for removal of company with the Registrar? \nDirector, Secretary, shareholder \n \n21. When must a company file an annual return or a no change return? \nF24-Annual Return must be filed when there is any change in the company otherwise F26-No \nchange Return may be filed. \nIf there has been no change in the company and that the company's turnover is 20 million or \nabove, Form F26-No change Return must be filed together with the financial summary. \n \n22. What are the qualifications required for a secretary to be appointed in a Private or \nPublic company? \nThe qualifications are as mentioned in Section 165 and 198 of the Companies Act 2001. \n \n23. Is it necessary to appoint a secretary? \nNo it is not mandatory to appoint a secretary unless it is a one-person company, i.e., where the \nDirector or the Shareholder is the same person. \n \nThe Company has to appoint a secretary within a period of 6 months. \n \n \n24. How to change currency? \nTo seek approval from ROC by making an application in writing \n \n25. Which documents are required to be filed when company adopts/ alters or revokes a \nconstitution? \nForm F12-Notice of Adoption/Alteration/Revocation of Constitution. \nConstitution or amended Constitution, as the case may be and a legal certificate \na special Resolution (shareholders resolution). \n \n26. Can domestic companies have a corporate director? No \n \n \n \n \nFrequently Asked Questions \n \n \n4 \n

"token_count_estimate": 3064. Yes i am using the default embedding models because the localAI hosted one would sometimes give me the "error document could not be embedded" in anythingLLM.

maybe it's because i am using free LLM models, but these are my only option as i shall be using it for confidential documents if my tests are successful. Maybe you could try it on your end?. Thanks :D

TaariqN commented 11 months ago

I just tried doing queries on a simple text file. Frequently Asked Questions.txt Prompt used

Given the following conversation, relevant context, and a follow up question, reply with an answer to the current question the user is asking. Return only your response to the question given the above information following the users instructions as needed. Only give answer found in the documents, if no answer is available reply with "i dont know"

It does hallucinates sometimes though. In this example, I did not mention opening hours at all in the text file , so I noticed that the model and/or query mode may not be only referring to my documents but also on its general knowledge. :(

timothycarambat commented 11 months ago

Downloaded the FAQ.pdf
Uploaded to fresh workspace (Using GPT-3.5, default AnythingLLM embedder, LanceDB vectordb). Did not change any other settings
sent the same query in both chat and query mode and got the same answer.

TaariqN commented 11 months ago

Seems that it is an issue in using free LLMs i guess. I can't use openAI however, so i guess i'm currently stuck. Thanks for your time 😄

Mintplex-Labs / anything-llm

Pdf Fomatting help #424