arc53 / DocsGPT

Chatbot for documentation, that allows you to chat with your data. Privately deployable, provides AI knowledge sharing and integrates knowledge into your AI workflow
https://app.docsgpt.cloud/
MIT License
15.03k stars 1.6k forks source link

🐛 Bug Report: Large zip breaking stream endpoint #859

Open pabik opened 9 months ago

pabik commented 9 months ago

📜 Description

Stream endpoint doesn't provide answer when embedded file in zip archive is long.

👟 Reproduction steps

  1. Upload a zip file
  2. Try chatting docs_tester.zip

👍 Expected behavior

DocsGPT should provide an answer.

👎 Actual Behavior with Screenshots

No answer, stream endpoint breaks. image

💻 Operating system

MacOS

What browsers are you seeing the problem on?

Chrome

🤖 What development environment are you experiencing this bug on?

Docker

🔒 Did you set the correct environment variables in the right path? List the environment variable names (not values please!)

No response

📃 Provide any additional context for the Bug.

No response

📖 Relevant log output

No response

👀 Have you spent some time to check if this bug has been raised before?

🔗 Are you willing to submit PR?

None

🧑‍⚖️ Code of Conduct

dartpain commented 5 months ago

Looks like it happens because the file is not being chunked properly or at all when answering, thus resulting current context token overload

nayelimdejesus commented 5 months ago

Hi, I would like to work on this issue.

nayelimdejesus commented 5 months ago

When you upload a big zip file what answer should it provide?

dartpain commented 5 months ago

Just shouldn't break. Basically make sure that it doesn't error out. Try running it with the file attached.

jayantp2003 commented 1 month ago

I am interested to work on this issue.

jayantp2003 commented 1 month ago

I was playing around with the zip file and couple of different files, I found that its not an issue related to chunking of code, there is some issue with RstParser class, I did update the file extensions to text file, for that case, it was working fine.

image image

Currently checking the Rstparser class to figure out the changes required.

jayantp2003 commented 1 month ago

The issue is with the implementation of rst parser, in each file, it looks for a header and a text below it, but for the zip file we are testing on, it is just a single file with no header available, hence it is not being chunked. This header and text breakdown thing also seems to be an issue for markdown parser. The file should be chunked based on tokens or bytes and this tuple implementation also need to be updated.

dartpain commented 1 month ago

Yeah seems like thats the issue, lets add another token size handler to it maybe?

jayantp2003 commented 1 month ago

Hey, I have updated the code and created a PR, can you review it and approve, I am new to open source contributions and do not know how it works, after making a PR. Open to feedbacks.

jayantp2003 commented 1 month ago

@dartpain Can you review my changes and provide feedback, and approve if implementation seems correct.