Closed choccccy closed 1 year ago
I put together this notebook to try to encapsulate the issue for repro. Works for me, tho (takes over 3 mins) pdf_encoding_error_investigation.ipynb.zip
Capturing a few notes, just in case some sort of mitigation is required.
How we might eliminate control chars, if need be:
import sys, unicodedata, re
# All non printable characters. Could use walrus op, but meh!
control_chars = ''.join(
chr(i) for i in range(sys.maxunicode) if unicodedata.category(chr(i)) == 'Cc'
)
# Create regex of above characters
control_char_re = re.compile('[%s]' % re.escape(control_chars))
# Empty string substitution
def remove_control_chars(s):
return control_char_re.sub('', s)
print (remove_control_chars('\x00\x01String'))
PyPDF2 is no longer relevant, as dev has returned to the original pypd. Need to update demos. Alternatives include pdfplumber, pikepdf, pdfminer.six & PyMuPDF (commercial licensing restrictions with this one)
fixed! embedding_helper.py will now check that it is being passed a valid SentenceTransformer class object, instead of cryptically throwing an encoding error.
Borrowing some code from one of the chat_pdf_streamlit_ui.py demo and kept running into this really annoying error that we cannot seem to get around:
the code in question:
Seems to be some sort of issue with embedding_helper.py or the HF
sentence-tranformer
library. Even after trying to scrub the contents of the chunks for non-ascii characters (not shown), we were still running into theLookupError
notably, the streamlit demo still seems fine, I was able to run it on the exact same .pdf files without issue