Closed DivyaKasala closed 1 year ago
Ciao, did you try with a different document? Getting the same errors? please specify your operating system, python version and pip freeze extact
I am using many documents. If I test it with one document, then it is working. But all my documents of same encoding type.
My operating system: Windows 11 python: 3.8 pip freeze extract: aiohttp==3.8.4 aiosignal==1.3.1 altair==4.2.2 anyio==3.6.2 argilla==1.7.0 async-timeout==4.0.2 attrs==23.1.0 backoff==2.2.1 backports.zoneinfo==0.2.1 blinker==1.6.2 cachetools==5.3.0 certifi==2023.5.7 cffi==1.15.1 charset-normalizer==3.1.0 click==8.1.3 colorama==0.4.6 commonmark==0.9.1 cryptography==40.0.2 dataclasses-json==0.5.7 decorator==5.1.1 Deprecated==1.2.13 entrypoints==0.4 et-xmlfile==1.1.0 faiss-cpu==1.7.4 filelock==3.12.0 frozenlist==1.3.3 gitdb==4.0.10 GitPython==3.1.31 greenlet==2.0.2 h11==0.14.0 httpcore==0.16.3 httpx==0.23.3 idna==3.4 importlib-metadata==6.6.0 importlib-resources==5.12.0 Jinja2==3.1.2 joblib==1.2.0 jsonschema==4.17.3 langchain==0.0.149 llama-cpp-python @ file:///C:/Users/kasal/Documents/GPT4ALL_Fabio/llama_cpp_python-0.1.49-cp38-cp38-win_amd64.whl lxml==4.9.2 Markdown==3.4.3 markdown-it-py==2.2.0 MarkupSafe==2.1.2 marshmallow==3.19.0 marshmallow-enum==1.5.1 mdurl==0.1.2 monotonic==1.6 mpmath==1.3.0 msg-parser==1.2.0 multidict==6.0.4 mypy-extensions==1.0.0 networkx==3.1 nltk==3.8.1 numexpr==2.8.4 numpy==1.23.5 olefile==0.46 openai==0.27.7 openapi-schema-pydantic==1.2.4 openpyxl==3.1.2 packaging==23.1 pandas==1.5.3 pdf2image==1.16.3 pdfminer.six==20221105 Pillow==9.5.0 pkgutil_resolve_name==1.3.10 protobuf==3.20.3 pyarrow==12.0.0 pycparser==2.21 pydantic==1.10.8 pydeck==0.8.1b0 Pygments==2.15.1 pygpt4all==1.0.1 pygptj==2.0.3 pyllamacpp==1.0.6 Pympler==1.0.1 pypandoc==1.11 pypdf==3.8.1 PyPDF2==3.0.1 pyrsistent==0.19.3 pytesseract==0.3.10 python-dateutil==2.8.2 python-docx==0.8.11 python-magic==0.4.27 python-pptx==0.6.21 pytz==2023.3 PyYAML==6.0 regex==2023.5.5 requests==2.31.0 rfc3986==1.5.0 rich==13.0.1 sentencepiece==0.1.99 six==1.16.0 smmap==5.0.0 sniffio==1.3.0 SQLAlchemy==2.0.15 streamlit==1.22.0 streamlit-ace==0.1.1 sympy==1.12 tenacity==8.2.2 tiktoken==0.4.0 toml==0.10.2 toolz==0.12.0 torch==2.0.1 tornado==6.3.2 tqdm==4.65.0 typer==0.9.0 typing-inspect==0.9.0 typing_extensions==4.6.1 tzdata==2023.3 tzlocal==5.0.1 unstructured==0.6.5 urllib3==2.0.2 validators==0.20.0 watchdog==3.0.0 wrapt==1.14.1 XlsxWriter==3.1.1 yarl==1.9.2 zipp==3.15.0
Ciao, I was hoping that even one documents was not working... in that case the problem would have been with the embeddings. But is seems is with the LangChain loader. anyway, try with the official Sentence Transformes from Hugging face. do not use the alpaca embiedddings.
from langchain.embeddings import HuggingFaceEmbeddings, SentenceTransformerEmbeddings
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
this will download a small set of files with new embeddings. Let me know how it will go
I also find this one pypdf + PyCryptodome pypdf is a free and open source pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files. We will be using this library to parse our PDF files. PyCryptodome is another library that helps prevent errors while parsing PDF files.
source this Medium article
I am using many documents. If I test it with one document, then it is working. But all my documents of same encoding type.
My operating system: Windows 11 python: 3.8 pip freeze extract: aiohttp==3.8.4 aiosignal==1.3.1 altair==4.2.2 anyio==3.6.2 argilla==1.7.0 async-timeout==4.0.2 attrs==23.1.0 backoff==2.2.1 backports.zoneinfo==0.2.1 blinker==1.6.2 cachetools==5.3.0 certifi==2023.5.7 cffi==1.15.1 charset-normalizer==3.1.0 click==8.1.3 colorama==0.4.6 commonmark==0.9.1 cryptography==40.0.2 dataclasses-json==0.5.7 decorator==5.1.1 Deprecated==1.2.13 entrypoints==0.4 et-xmlfile==1.1.0 faiss-cpu==1.7.4 filelock==3.12.0 frozenlist==1.3.3 gitdb==4.0.10 GitPython==3.1.31 greenlet==2.0.2 h11==0.14.0 httpcore==0.16.3 httpx==0.23.3 idna==3.4 importlib-metadata==6.6.0 importlib-resources==5.12.0 Jinja2==3.1.2 joblib==1.2.0 jsonschema==4.17.3 langchain==0.0.149 llama-cpp-python @ file:///C:/Users/kasal/Documents/GPT4ALL_Fabio/llama_cpp_python-0.1.49-cp38-cp38-win_amd64.whl lxml==4.9.2 Markdown==3.4.3 markdown-it-py==2.2.0 MarkupSafe==2.1.2 marshmallow==3.19.0 marshmallow-enum==1.5.1 mdurl==0.1.2 monotonic==1.6 mpmath==1.3.0 msg-parser==1.2.0 multidict==6.0.4 mypy-extensions==1.0.0 networkx==3.1 nltk==3.8.1 numexpr==2.8.4 numpy==1.23.5 olefile==0.46 openai==0.27.7 openapi-schema-pydantic==1.2.4 openpyxl==3.1.2 packaging==23.1 pandas==1.5.3 pdf2image==1.16.3 pdfminer.six==20221105 Pillow==9.5.0 pkgutil_resolve_name==1.3.10 protobuf==3.20.3 pyarrow==12.0.0 pycparser==2.21 pydantic==1.10.8 pydeck==0.8.1b0 Pygments==2.15.1 pygpt4all==1.0.1 pygptj==2.0.3 pyllamacpp==1.0.6 Pympler==1.0.1 pypandoc==1.11 pypdf==3.8.1 PyPDF2==3.0.1 pyrsistent==0.19.3 pytesseract==0.3.10 python-dateutil==2.8.2 python-docx==0.8.11 python-magic==0.4.27 python-pptx==0.6.21 pytz==2023.3 PyYAML==6.0 regex==2023.5.5 requests==2.31.0 rfc3986==1.5.0 rich==13.0.1 sentencepiece==0.1.99 six==1.16.0 smmap==5.0.0 sniffio==1.3.0 SQLAlchemy==2.0.15 streamlit==1.22.0 streamlit-ace==0.1.1 sympy==1.12 tenacity==8.2.2 tiktoken==0.4.0 toml==0.10.2 toolz==0.12.0 torch==2.0.1 tornado==6.3.2 tqdm==4.65.0 typer==0.9.0 typing-inspect==0.9.0 typing_extensions==4.6.1 tzdata==2023.3 tzlocal==5.0.1 unstructured==0.6.5 urllib3==2.0.2 validators==0.20.0 watchdog==3.0.0 wrapt==1.14.1 XlsxWriter==3.1.1 yarl==1.9.2 zipp==3.15.0
any news?
There is no issue with packages which are used for loading. The issue is after loading while printing answer, it could not print few special characters to the console which are in utf8. So, I manually written code to decode it. Thank you.
Amazing DivyaKasala, if you want to share the code I will add it to the troubleshooting section mentioning you. Meanwhile I close this issue as resolved
While executing db_loading.py code , I am passing a prompt as input (question), instead of fetching the answer to the prompt I have got "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 0: unexpected end of data". I have already tried to manually change the encoding of file to utf-8 and I have also tried text =text.decode(). But both of them are not working. Please help to solve the problem.