fabiomatricardi / GPT4All_Medium

Repo of the code from the Medium article
https://artificialcorner.com/gpt4all-is-the-local-chatgpt-for-your-documents-and-it-is-free-df1016bc335
Creative Commons Zero v1.0 Universal
85 stars 41 forks source link

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 0: unexpected end of data #6

Closed DivyaKasala closed 1 year ago

DivyaKasala commented 1 year ago

While executing db_loading.py code , I am passing a prompt as input (question), instead of fetching the answer to the prompt I have got "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 0: unexpected end of data". I have already tried to manually change the encoding of file to utf-8 and I have also tried text =text.decode(). But both of them are not working. Please help to solve the problem.

fabiomatricardi commented 1 year ago

Ciao, did you try with a different document? Getting the same errors? please specify your operating system, python version and pip freeze extact

DivyaKasala commented 1 year ago

I am using many documents. If I test it with one document, then it is working. But all my documents of same encoding type.

My operating system: Windows 11 python: 3.8 pip freeze extract: aiohttp==3.8.4 aiosignal==1.3.1 altair==4.2.2 anyio==3.6.2 argilla==1.7.0 async-timeout==4.0.2 attrs==23.1.0 backoff==2.2.1 backports.zoneinfo==0.2.1 blinker==1.6.2 cachetools==5.3.0 certifi==2023.5.7 cffi==1.15.1 charset-normalizer==3.1.0 click==8.1.3 colorama==0.4.6 commonmark==0.9.1 cryptography==40.0.2 dataclasses-json==0.5.7 decorator==5.1.1 Deprecated==1.2.13 entrypoints==0.4 et-xmlfile==1.1.0 faiss-cpu==1.7.4 filelock==3.12.0 frozenlist==1.3.3 gitdb==4.0.10 GitPython==3.1.31 greenlet==2.0.2 h11==0.14.0 httpcore==0.16.3 httpx==0.23.3 idna==3.4 importlib-metadata==6.6.0 importlib-resources==5.12.0 Jinja2==3.1.2 joblib==1.2.0 jsonschema==4.17.3 langchain==0.0.149 llama-cpp-python @ file:///C:/Users/kasal/Documents/GPT4ALL_Fabio/llama_cpp_python-0.1.49-cp38-cp38-win_amd64.whl lxml==4.9.2 Markdown==3.4.3 markdown-it-py==2.2.0 MarkupSafe==2.1.2 marshmallow==3.19.0 marshmallow-enum==1.5.1 mdurl==0.1.2 monotonic==1.6 mpmath==1.3.0 msg-parser==1.2.0 multidict==6.0.4 mypy-extensions==1.0.0 networkx==3.1 nltk==3.8.1 numexpr==2.8.4 numpy==1.23.5 olefile==0.46 openai==0.27.7 openapi-schema-pydantic==1.2.4 openpyxl==3.1.2 packaging==23.1 pandas==1.5.3 pdf2image==1.16.3 pdfminer.six==20221105 Pillow==9.5.0 pkgutil_resolve_name==1.3.10 protobuf==3.20.3 pyarrow==12.0.0 pycparser==2.21 pydantic==1.10.8 pydeck==0.8.1b0 Pygments==2.15.1 pygpt4all==1.0.1 pygptj==2.0.3 pyllamacpp==1.0.6 Pympler==1.0.1 pypandoc==1.11 pypdf==3.8.1 PyPDF2==3.0.1 pyrsistent==0.19.3 pytesseract==0.3.10 python-dateutil==2.8.2 python-docx==0.8.11 python-magic==0.4.27 python-pptx==0.6.21 pytz==2023.3 PyYAML==6.0 regex==2023.5.5 requests==2.31.0 rfc3986==1.5.0 rich==13.0.1 sentencepiece==0.1.99 six==1.16.0 smmap==5.0.0 sniffio==1.3.0 SQLAlchemy==2.0.15 streamlit==1.22.0 streamlit-ace==0.1.1 sympy==1.12 tenacity==8.2.2 tiktoken==0.4.0 toml==0.10.2 toolz==0.12.0 torch==2.0.1 tornado==6.3.2 tqdm==4.65.0 typer==0.9.0 typing-inspect==0.9.0 typing_extensions==4.6.1 tzdata==2023.3 tzlocal==5.0.1 unstructured==0.6.5 urllib3==2.0.2 validators==0.20.0 watchdog==3.0.0 wrapt==1.14.1 XlsxWriter==3.1.1 yarl==1.9.2 zipp==3.15.0

fabiomatricardi commented 1 year ago

Ciao, I was hoping that even one documents was not working... in that case the problem would have been with the embeddings. But is seems is with the LangChain loader. anyway, try with the official Sentence Transformes from Hugging face. do not use the alpaca embiedddings.

from langchain.embeddings import HuggingFaceEmbeddings, SentenceTransformerEmbeddings 
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

this will download a small set of files with new embeddings. Let me know how it will go

fabiomatricardi commented 1 year ago

I also find this one pypdf + PyCryptodome pypdf is a free and open source pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files. We will be using this library to parse our PDF files. PyCryptodome is another library that helps prevent errors while parsing PDF files.

source this Medium article

fabiomatricardi commented 1 year ago

I am using many documents. If I test it with one document, then it is working. But all my documents of same encoding type.

My operating system: Windows 11 python: 3.8 pip freeze extract: aiohttp==3.8.4 aiosignal==1.3.1 altair==4.2.2 anyio==3.6.2 argilla==1.7.0 async-timeout==4.0.2 attrs==23.1.0 backoff==2.2.1 backports.zoneinfo==0.2.1 blinker==1.6.2 cachetools==5.3.0 certifi==2023.5.7 cffi==1.15.1 charset-normalizer==3.1.0 click==8.1.3 colorama==0.4.6 commonmark==0.9.1 cryptography==40.0.2 dataclasses-json==0.5.7 decorator==5.1.1 Deprecated==1.2.13 entrypoints==0.4 et-xmlfile==1.1.0 faiss-cpu==1.7.4 filelock==3.12.0 frozenlist==1.3.3 gitdb==4.0.10 GitPython==3.1.31 greenlet==2.0.2 h11==0.14.0 httpcore==0.16.3 httpx==0.23.3 idna==3.4 importlib-metadata==6.6.0 importlib-resources==5.12.0 Jinja2==3.1.2 joblib==1.2.0 jsonschema==4.17.3 langchain==0.0.149 llama-cpp-python @ file:///C:/Users/kasal/Documents/GPT4ALL_Fabio/llama_cpp_python-0.1.49-cp38-cp38-win_amd64.whl lxml==4.9.2 Markdown==3.4.3 markdown-it-py==2.2.0 MarkupSafe==2.1.2 marshmallow==3.19.0 marshmallow-enum==1.5.1 mdurl==0.1.2 monotonic==1.6 mpmath==1.3.0 msg-parser==1.2.0 multidict==6.0.4 mypy-extensions==1.0.0 networkx==3.1 nltk==3.8.1 numexpr==2.8.4 numpy==1.23.5 olefile==0.46 openai==0.27.7 openapi-schema-pydantic==1.2.4 openpyxl==3.1.2 packaging==23.1 pandas==1.5.3 pdf2image==1.16.3 pdfminer.six==20221105 Pillow==9.5.0 pkgutil_resolve_name==1.3.10 protobuf==3.20.3 pyarrow==12.0.0 pycparser==2.21 pydantic==1.10.8 pydeck==0.8.1b0 Pygments==2.15.1 pygpt4all==1.0.1 pygptj==2.0.3 pyllamacpp==1.0.6 Pympler==1.0.1 pypandoc==1.11 pypdf==3.8.1 PyPDF2==3.0.1 pyrsistent==0.19.3 pytesseract==0.3.10 python-dateutil==2.8.2 python-docx==0.8.11 python-magic==0.4.27 python-pptx==0.6.21 pytz==2023.3 PyYAML==6.0 regex==2023.5.5 requests==2.31.0 rfc3986==1.5.0 rich==13.0.1 sentencepiece==0.1.99 six==1.16.0 smmap==5.0.0 sniffio==1.3.0 SQLAlchemy==2.0.15 streamlit==1.22.0 streamlit-ace==0.1.1 sympy==1.12 tenacity==8.2.2 tiktoken==0.4.0 toml==0.10.2 toolz==0.12.0 torch==2.0.1 tornado==6.3.2 tqdm==4.65.0 typer==0.9.0 typing-inspect==0.9.0 typing_extensions==4.6.1 tzdata==2023.3 tzlocal==5.0.1 unstructured==0.6.5 urllib3==2.0.2 validators==0.20.0 watchdog==3.0.0 wrapt==1.14.1 XlsxWriter==3.1.1 yarl==1.9.2 zipp==3.15.0

any news?

DivyaKasala commented 1 year ago

There is no issue with packages which are used for loading. The issue is after loading while printing answer, it could not print few special characters to the console which are in utf8. So, I manually written code to decode it. Thank you.

fabiomatricardi commented 1 year ago

Amazing DivyaKasala, if you want to share the code I will add it to the troubleshooting section mentioning you. Meanwhile I close this issue as resolved