I occationally found a file would be read incorrectly in the langchain readding step. But other pdf files in my hand works well. So I am not sure it is my configuration problem or the file is too special for langchain to read. I tried some online tools to extract the text, and they are all fine.
It looks like
Bupa A\nustralia\nNothing is mor\ne important t\no us than pr\noviding our\nmembers with quality c\norpor\nate health insur\nanc\ne.\n
Bupa A
ustralia
Nothing is mor
e important t
o us than pr
oviding our
members with quality c
orpor
ate health insur
anc
e.
Many words have been divied by newline without obvious pattern or reason. Then the new line char would be convert to space in the next step, make it would have potential bias when embedding.
If you are intereted in. Here is the link for the original file:
https://github.com/Scott-Zeta/chatpdf/tree/API-call-method/attachment
I occationally found a file would be read incorrectly in the langchain readding step. But other pdf files in my hand works well. So I am not sure it is my configuration problem or the file is too special for langchain to read. I tried some online tools to extract the text, and they are all fine.
It looks like
Many words have been divied by newline without obvious pattern or reason. Then the new line char would be convert to space in the next step, make it would have potential bias when embedding. If you are intereted in. Here is the link for the original file: https://github.com/Scott-Zeta/chatpdf/tree/API-call-method/attachment