Elliott-Chong / chatpdf-yt

https://chatpdf-elliott.vercel.app
694 stars 284 forks source link

Langchain Issue? #24

Open Scott-Zeta opened 11 months ago

Scott-Zeta commented 11 months ago

I occationally found a file would be read incorrectly in the langchain readding step. But other pdf files in my hand works well. So I am not sure it is my configuration problem or the file is too special for langchain to read. I tried some online tools to extract the text, and they are all fine.

It looks like

Bupa A\nustralia\nNothing is mor\ne important t\no us than pr\noviding our\nmembers with quality c\norpor\nate health insur\nanc\ne.\n
Bupa A
ustralia
Nothing is mor
e important t
o us than pr
oviding our
members with quality c
orpor
ate health insur
anc
e.

Many words have been divied by newline without obvious pattern or reason. Then the new line char would be convert to space in the next step, make it would have potential bias when embedding. If you are intereted in. Here is the link for the original file: https://github.com/Scott-Zeta/chatpdf/tree/API-call-method/attachment