Closed brinrbc closed 1 year ago
Thank you so much for creating this issue! I'm hoping to fix this in the next week maximum. But first, please allow me to explain what exactly happening & my plan to fix it. I'm open to getting any contributions.
Problems: There're mainly two, but we can only fix one since we're limited to the OpenAI's technical limitation of maximum 4097 tokens when using their models, however by modifying the source code it's possible to use other models:
1) The function CharacterTextSplitter doesn't work as it supposed to, and this is more of a bug rather than a design issue. So I'll sort that with some other methods No matter how big of a file you enter to this app, it will split all the text into smaller chunks, and then the model FAISS will check the question you enter and find the most relevant 3 chunks and then these 3 chunks along with your question will go the to OpenAI's text-davinci-003 model by default. See the process overview & the individual .py files in the readme file for more detail. Maximum chunk size is 1000 for each chunk, combining it with your question, it is unlikely that it'll reach the limit. However, it seems that the CharacterTextSplitter created chunks that're a lot bigger than 1000 in your case, even though it is clearly stated as a parameter. I just saw other facing similar problems with this function here . One strong argument is that the function chunks at the nearest line break or other character.
2) OpenAI maximum token limit Unfortunately, we can not do much about this because it's their model & their limitation. If we change it to some other model that has larger token limit we can work it out with 290k tokens as in your example, but currently the greatest quality LLM lies in the hands of OpenAI api. I also plan to integrate changing models in use easily in the GUI.
Solution plan:
The first step will be done in few minutes, and I'll start working on it today. Thanks again @brinrbc . I hope this will help you and others.
Thank you very much for your reply! I tried to try different documents in different formats, it worked with different languages, but it gave an error on this file
By the way, despite the fact that there is an automatic installation of PyPDF2 in requirements==3.0.1, it did not install on my Mac in a virtual environment,
pip install PyPDF2
did not help, it helped me
conda install -c conda-forge pypdf2
Thanks for sharing. I'll test, try and share an update here.
About the packages, I've used Mac for a short period of time, I can't share lots of details. but standard way to use pip with Mac is to use pip3 install PyPDF2
not pip install ...
Conda approach is great! I strongly suggest creating a fresh conda environment for projects, because sometimes some packages cause conflicts with others due to different versions being installed on your system.
thank you, you made the right point. I did pip3 install PyPDF2
too. both options worked, the installation went through, but an error still appeared until conda installed
Hello @brinrbc ! I just fixed the issue, the technical details can be found in this pull request - I've changed the behaviour that splits text into chunks, now it's able to do it.
Following the solution plan explained above, I've recreated this error using the file you shared & tried again with the same prompt.
Before:
After:
I hope it helps you. Please let me know if you face any other issue or re-open this issue. Thanks again for contributing to this repository!
Hi! I understand what is written here, and I can assume that the downloaded file is trying to go to the request, but it is obvious that the files may be larger than the input of the dialog box. Is there something you can do about it, or will you have to limit incoming files to the size of the input window?