OoriData / OgbujiPT

Client-side toolkit for using large language models, including where self-hosted
Apache License 2.0
103 stars 8 forks source link

Strange encoding bug #25

Closed choccccy closed 1 year ago

choccccy commented 1 year ago

Borrowing some code from one of the chat_pdf_streamlit_ui.py demo and kept running into this really annoying error that we cannot seem to get around:

Traceback (most recent call last):
  File "/Users/choccy/dev/OoriChat/dindy/dindoid.py", line 214, in <module>
    main()
  File "/Users/choccy/dev/OoriChat/dindy/dindoid.py", line 191, in main
    vectorize_pdfs()
  File "/Users/choccy/dev/OoriChat/dindy/dindoid.py", line 177, in vectorize_pdfs
    kb.update(texts=page_chunks)
  File "/Users/choccy/.local/venv/main/lib/python3.11/site-packages/ogbujipt/embedding_helper.py", line 142, in update
    self._first_update_prep(texts[0])
  File "/Users/choccy/.local/venv/main/lib/python3.11/site-packages/ogbujipt/embedding_helper.py", line 99, in _first_update_prep
    partial_embeddings = self._embedding_model.encode(text)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
LookupError: unknown encoding: C O N T E N T S
Preface 4
Introduction 5
Worlds of Adventure...................................................................5 Using This Book.........................................................................6 How to Play...................................................................................6 A d v e n t u r e s .................................................................................... 7
Part1 9

the code in question:

# Prepare a vector knowledgebase based on the pdf contents
# Use st.session_state to avoid unnecessary reprocessing/reloading
pdf_reader = PdfReader(pdf)
text = ''.join((page.extract_text() for page in pdf_reader.pages))
chunks = text_splitter(
    text, 
    chunk_size=EMBED_CHUNK_SIZE,
    chunk_overlap=EMBED_CHUNK_OVERLAP,
    separator='\n')

# Update vector DB collection, insert the text chunks & update app state
kb.update(texts=chunks)

Seems to be some sort of issue with embedding_helper.py or the HF sentence-tranformer library. Even after trying to scrub the contents of the chunks for non-ascii characters (not shown), we were still running into the LookupError

notably, the streamlit demo still seems fine, I was able to run it on the exact same .pdf files without issue

uogbuji commented 1 year ago

I put together this notebook to try to encapsulate the issue for repro. Works for me, tho (takes over 3 mins) pdf_encoding_error_investigation.ipynb.zip

uogbuji commented 1 year ago

Capturing a few notes, just in case some sort of mitigation is required.

How we might eliminate control chars, if need be:

import sys, unicodedata, re
# All non printable characters. Could use walrus op, but meh!
control_chars = ''.join(
    chr(i) for i in range(sys.maxunicode) if unicodedata.category(chr(i)) == 'Cc'
    )
# Create regex of above characters
control_char_re = re.compile('[%s]' % re.escape(control_chars))
# Empty string substitution
def remove_control_chars(s):
    return control_char_re.sub('', s)
print (remove_control_chars('\x00\x01String'))

PyPDF2 is no longer relevant, as dev has returned to the original pypd. Need to update demos. Alternatives include pdfplumber, pikepdf, pdfminer.six & PyMuPDF (commercial licensing restrictions with this one)

choccccy commented 1 year ago

fixed! embedding_helper.py will now check that it is being passed a valid SentenceTransformer class object, instead of cryptically throwing an encoding error.