[Question] 100k PDFs?! - Githubissues

MatteoRiva95 commented 8 months ago

Hello,

First of all, thank you so much @InsightEdge01 for your work and your YT channel. Your projects are so interesting and look promising :) I have a question: can I use "Question-AnswerPairGenerator" with 100k PDFs? If yes, how can I upload them without clicking "Browse files" button every time?

Any help would really appreciated. Thank you in advance!

InsightEdge01 commented 8 months ago

Thank you for your feedback; I genuinely appreciate it! By default, Streamlit limits file uploads to 200MB, which may vary depending on file sizes. If you're considering using Streamlit and wish to upload files without displaying them an the frontend, I have a related video tutorial that demonstrates this process. You can watch Build Your Own Customer Response Generator with Business Knowledge Using Llama2|ALL OPENSOURCE - YouTube https://www.youtube.com/watch?v=aZ44PmGRLkg&t=1273s on YouTube. I hope this helps!

On Wed, Feb 21, 2024 at 8:21 AM TeoR95 @.***> wrote:

Hello,

First of all, thank you so much @InsightEdge01 https://github.com/InsightEdge01 for your work and your YT channel. Your projects are so interesting and look promising :) I have a question: can I use "Question-AnswerPairGenerator" with 100k PDFs? If yes, how can I upload them without clicking "Browse files" button every time?

Any help would really appreciated. Thank you in advance!

— Reply to this email directly, view it on GitHub https://github.com/InsightEdge01/Question-AnswerPairGenerator/issues/1, or unsubscribe https://github.com/notifications/unsubscribe-auth/A7LFIPR3HSB3I37J3RAP2ZTYUX7HRAVCNFSM6AAAAABDTETF46VHI2DSMVQWIX3LMV43ASLTON2WKOZSGE2DMOBXG43TKOA . You are receiving this because you were mentioned.Message ID: @.***>

MatteoRiva95 commented 8 months ago

@InsightEdge01 Thank you for your kind reply! Yes, I tried to use RAG in order to have a chatbot capable to reply to my questions about the 100k PDFs. Unfortunately, it was really really slow and to reply to only one question it took almost 10 minutes :( So what I thought is to use fine-tuning, but to do it I need a Question/Answer pair dataset. I used your "Question-AnswerPairGenerator" script (without the Streamlit lines) with only one PDF from the 100k ones, but it gives me back this error:

HTTPError Traceback (most recent call last) /usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_errors.py in hf_raise_for_status(response, endpoint_name) 285 try: --> 286 response.raise_for_status() 287 except HTTPError as e:

24 frames HTTPError: 429 Client Error: Too Many Requests for url: https://api-inference.huggingface.co/models/mistralai/Mixtral-8x7B-Instruct-v0.1

The above exception was the direct cause of the following exception:

HfHubHTTPError Traceback (most recent call last) /usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_errors.py in hf_raise_for_status(response, endpoint_name) 331 # Convert HTTPError into a HfHubHTTPError to display request information 332 # as well (request id and/or server error message) --> 333 raise HfHubHTTPError(str(e), response=response) from e 334 335

HfHubHTTPError: 429 Client Error: Too Many Requests for url: https://api-inference.huggingface.co/models/mistralai/Mixtral-8x7B-Instruct-v0.1

I do not know how to continue :(

InsightEdge01 / Question-AnswerPairGenerator

[Question] 100k PDFs?! #1