Embedding and score calculation for PDF summarization

ayoubelmhamdi commented 1 year ago

I hope this message finds you well. I wanted to bring to your attention a potential issue I noticed while exploring your repository. Specifically,

I noticed that you are using embedding to get vectors of chunks of each part of the pdf and then embedding the question to get the top_k=3 of the score of the dot product between the question and vectors.

This is an interesting approach, but I noticed you will get only the chunks that talk about summarizing in the PDF, so will returns prompt + 3 chunks from the PDF. However, I noticed that this leads to sending garbage to chatGPT.

I just wanted to bring this to your attention in case it was unintentional or if there is a better way to approach this. Thank you for your time and for sharing your work on this repository.

Best regards,

beverm2391 commented 1 year ago

AYOUB, thanks for taking a look at my code! Can you provide the name of the file you're referring to - I'll go take a look

ayoubelmhamdi commented 1 year ago

here you give top_k=5 https://github.com/beverm2391/ai-summarizer/blob/600dd69cbd6c3fb83a2397d23fdc610a2efb96ed/package/mongodoc.py#L24

Eventually, a prompt consisting of a question and five chunks will be sent to ChatGPT. The recommended approach for summarizing currently involves using the longchain method, which entails dividing a PDF document into the maximum number of tokens feasible and generating a summary from each section. This results in a comprehensive summary of all parts.

beverm2391 / ai-summarizer

Embedding and score calculation for PDF summarization #1