Open liuyang77886 opened 1 year ago
@liuyang77886 Seems like its not able to read your file, it can be an empty file. Can you check if the file is in the right path?
Just fyi, I just got this same error. python 3.10 also.. using device_type cpu. I ran the ingest.py seemingly ok. Investigating..
I faced this too. Upon inspection, I found that some of my .pdf files in SOURCE DOCUMENT folder were image-based pdf rather than being text-based pdf. Ingesting text-based pdf doesn't face this issue.
run apt-get install ffmpeg libsm6 libxext6 -y to solve this problem
When I'm on the cpu, everything works fine, when I switch to the GPU, all sorts of errors. Currently, NVCC -V is version 12.1, corresponding to torch 2.1.2+cu121 torchaudio 2.1.2+cu121 torchvision 0.16.2+cu121 but still error like this: python ingest.py --device_type cuda /root/anaconda3/envs/localGPT/lib/python3.10/site-packages/langchain/_api/module_import.py:120: LangChainDeprecationWarning: Importing Chroma from langchain.vectorstores is deprecated. Please replace deprecated imports:
from langchain.vectorstores import Chroma
with new imports of:
from langchain_community.vectorstores import Chroma
warn_deprecated( /root/anaconda3/envs/localGPT/lib/python3.10/site-packages/langchain/_api/module_import.py:120: LangChainDeprecationWarning: Importing CSVLoader from langchain.document_loaders is deprecated. Please replace deprecated imports:
from langchain.document_loaders import CSVLoader
with new imports of:
from langchain_community.document_loaders import CSVLoader
warn_deprecated( /root/anaconda3/envs/localGPT/lib/python3.10/site-packages/langchain/_api/module_import.py:120: LangChainDeprecationWarning: Importing PDFMinerLoader from langchain.document_loaders is deprecated. Please replace deprecated imports:
from langchain.document_loaders import PDFMinerLoader
with new imports of:
from langchain_community.document_loaders import PDFMinerLoader
warn_deprecated( /root/anaconda3/envs/localGPT/lib/python3.10/site-packages/langchain/_api/module_import.py:120: LangChainDeprecationWarning: Importing TextLoader from langchain.document_loaders is deprecated. Please replace deprecated imports:
from langchain.document_loaders import TextLoader
with new imports of:
from langchain_community.document_loaders import TextLoader
warn_deprecated( /root/anaconda3/envs/localGPT/lib/python3.10/site-packages/langchain/_api/module_import.py:120: LangChainDeprecationWarning: Importing UnstructuredExcelLoader from langchain.document_loaders is deprecated. Please replace deprecated imports:
from langchain.document_loaders import UnstructuredExcelLoader
with new imports of:
from langchain_community.document_loaders import UnstructuredExcelLoader
warn_deprecated( /root/anaconda3/envs/localGPT/lib/python3.10/site-packages/langchain/_api/module_import.py:120: LangChainDeprecationWarning: Importing Docx2txtLoader from langchain.document_loaders is deprecated. Please replace deprecated imports:
from langchain.document_loaders import Docx2txtLoader
with new imports of:
from langchain_community.document_loaders import Docx2txtLoader
warn_deprecated( /root/anaconda3/envs/localGPT/lib/python3.10/site-packages/langchain/_api/module_import.py:120: LangChainDeprecationWarning: Importing UnstructuredFileLoader from langchain.document_loaders is deprecated. Please replace deprecated imports:
from langchain.document_loaders import UnstructuredFileLoader
with new imports of:
from langchain_community.document_loaders import UnstructuredFileLoader
warn_deprecated( /root/anaconda3/envs/localGPT/lib/python3.10/site-packages/langchain/_api/module_import.py:120: LangChainDeprecationWarning: Importing UnstructuredMarkdownLoader from langchain.document_loaders is deprecated. Please replace deprecated imports:
from langchain.document_loaders import UnstructuredMarkdownLoader
with new imports of:
from langchain_community.document_loaders import UnstructuredMarkdownLoader
warn_deprecated( /root/anaconda3/envs/localGPT/lib/python3.10/site-packages/langchain/_api/module_import.py:120: LangChainDeprecationWarning: Importing UnstructuredHTMLLoader from langchain.document_loaders is deprecated. Please replace deprecated imports:
from langchain.document_loaders import UnstructuredHTMLLoader
with new imports of:
from langchain_community.document_loaders import UnstructuredHTMLLoader
warn_deprecated( /root/anaconda3/envs/localGPT/lib/python3.10/site-packages/langchain/_api/module_import.py:120: LangChainDeprecationWarning: Importing HuggingFaceInstructEmbeddings from langchain.embeddings is deprecated. Please replace deprecated imports:
from langchain.embeddings import HuggingFaceInstructEmbeddings
with new imports of:
from langchain_community.embeddings import HuggingFaceInstructEmbeddings
warn_deprecated( /root/anaconda3/envs/localGPT/lib/python3.10/site-packages/langchain/_api/module_import.py:120: LangChainDeprecationWarning: Importing HuggingFaceBgeEmbeddings from langchain.embeddings is deprecated. Please replace deprecated imports:
from langchain.embeddings import HuggingFaceBgeEmbeddings
with new imports of:
from langchain_community.embeddings import HuggingFaceBgeEmbeddings
warn_deprecated( /root/anaconda3/envs/localGPT/lib/python3.10/site-packages/langchain/_api/module_import.py:120: LangChainDeprecationWarning: Importing HuggingFaceEmbeddings from langchain.embeddings is deprecated. Please replace deprecated imports:
from langchain.embeddings import HuggingFaceEmbeddings
with new imports of:
from langchain_community.embeddings import HuggingFaceEmbeddings
warn_deprecated( 2024-05-15 03:38:18,606 - INFO - ingest.py:148 - Loading documents from /workspace/localGPT/SOURCE_DOCUMENTS Importing: Orca_paper.pdf 2024-05-15 03:38:18,614 - INFO - ingest.py:48 - Loading document batch /workspace/localGPT/SOURCE_DOCUMENTS/Orca_paper.pdf loaded.
2024-05-15 03:38:20,733 - INFO - utils.py:148 - Note: NumExpr detected 20 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2024-05-15 03:38:20,733 - INFO - utils.py:161 - NumExpr defaulting to 8 threads.
2024-05-15 03:38:20,860 - INFO -
2024-05-15 03:38:20,931 - INFO - ingest.py:157 - Loaded 1 documents from /workspace/localGPT/SOURCE_DOCUMENTS
2024-05-15 03:38:20,932 - INFO - ingest.py:158 - Split into 0 chunks of text
2024-05-15 03:38:21,510 - INFO - SentenceTransformer.py:66 - Load pretrained SentenceTransformer: hkunlp/instructor-large
load INSTRUCTOR_Transformer
/root/anaconda3/envs/localGPT/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.get(instance, owner)()
max_seq_length 512
2024-05-15 03:38:22,279 - INFO - ingest.py:169 - Loaded embeddings from hkunlp/instructor-large
Traceback (most recent call last):
File "/workspace/localGPT/ingest.py", line 183, in
when I reinstalled transformers slove the problems but now error: python ingest.py --device_type cuda 2024-05-15 04:34:22,580 - INFO - ingest.py:148 - Loading documents from /workspace/localGPT/SOURCE_DOCUMENTS Importing: Orca_paper.pdf 2024-05-15 04:34:22,587 - INFO - ingest.py:48 - Loading document batch /workspace/localGPT/SOURCE_DOCUMENTS/Orca_paper.pdf loaded.
2024-05-15 04:34:24,691 - INFO -
2024-05-15 04:34:24,756 - INFO - ingest.py:157 - Loaded 1 documents from /workspace/localGPT/SOURCE_DOCUMENTS
2024-05-15 04:34:24,756 - INFO - ingest.py:158 - Split into 0 chunks of text
2024-05-15 04:34:25,314 - INFO - SentenceTransformer.py:66 - Load pretrained SentenceTransformer: hkunlp/instructor-large
load INSTRUCTOR_Transformer
/root/anaconda3/envs/localGPT/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.get(instance, owner)()
max_seq_length 512
2024-05-15 04:34:26,132 - INFO - ingest.py:169 - Loaded embeddings from hkunlp/instructor-large
Traceback (most recent call last):
File "/workspace/localGPT/ingest.py", line 183, in
I hava resolved it by "apt-get install ffmpeg libsm6 libxext6 -y" and error next: python ingest.py --device_type cuda 2024-05-15 04:44:01,619 - INFO - ingest.py:148 - Loading documents from /workspace/localGPT/SOURCE_DOCUMENTS Importing: Orca_paper.pdf 2024-05-15 04:44:01,626 - INFO - ingest.py:48 - Loading document batch /workspace/localGPT/SOURCE_DOCUMENTS/Orca_paper.pdf loaded.
2024-05-15 04:44:03,628 - INFO -
Resource punkt not found. Please use the NLTK Downloader to obtain the resource:
import nltk nltk.download('punkt')
For more information see: https://www.nltk.org/data.html
Attempted to load tokenizers/punkt/PY3/english.pickle
Searched in:
'' LookupError:
Resource averaged_perceptron_tagger not found. Please use the NLTK Downloader to obtain the resource:
import nltk nltk.download('averaged_perceptron_tagger')
For more information see: https://www.nltk.org/data.html
Attempted to load taggers/averaged_perceptron_tagger/averaged_perceptron_tagger.pickle
Searched in:
But luckily a workaround was found, and now it's working properly!!!!!!!!!!!! download nltk_data from https://gitee.com/julyjohn/nltk_data/repository/archive/gh-pages.zip mkdir -p /root/nltk_data/ unzip gh-pages.zip unzip ./nltk_data-gh-pages/packages/taggers/averaged_perceptron_tagger.zip cp -r ../nltk_data-gh-pages/packages/taggers /root/nltk_data/
if you cannot download from https://huggingface.co/, you can edit the .py add import os os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'
python run_localGPT.py error,solve: pip install bitsandbytes-cuda113 pip uninstall bitsandbytes-cuda113 pip install bitsandbytes
Unfortunately I have the same issue again. Last time "sudo apt-get install ffmpeg libsm6 libxext6 -y" solved it for me, but now it doesn't and the re-installation of bitsandbytes neither
不支持中文吗,我上传中文的pdf内容报错。
环境信息如下: Python 3.10.11 DISTRIB_ID=Ubuntu DISTRIB_RELEASE=20.04 DISTRIB_CODENAME=focal DISTRIB_DESCRIPTION="Ubuntu 20.04.4 LTS" NAME="Ubuntu" VERSION="20.04.4 LTS (Focal Fossa)" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 20.04.4 LTS" VERSION_ID="20.04" HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" VERSION_CODENAME=focal UBUNTU_CODENAME=focal
运行报错信息如下: (py310) root@autodl-container-a4da118bfa-64e0dc0a:~/autodl-tmp/localGPT# python ingest.py 2023-06-17 23:03:33,532 - INFO - ingest.py:107 - Loading documents from /root/autodl-tmp/localGPT/SOURCE_DOCUMENTS 2023-06-17 23:03:33,540 - INFO - ingest.py:33 - Loading document batch 2023-06-17 23:03:33,590 - INFO - ingest.py:111 - Loaded 1 documents from /root/autodl-tmp/localGPT/SOURCE_DOCUMENTS 2023-06-17 23:03:33,590 - INFO - ingest.py:112 - Split into 0 chunks of text 2023-06-17 23:03:35,605 - INFO - SentenceTransformer.py:66 - Load pretrained SentenceTransformer: hkunlp/instructor-large load INSTRUCTOR_Transformer max_seq_length 512 2023-06-17 23:03:39,222 - INFO - init.py:88 - Running Chroma using direct local API. 2023-06-17 23:03:39,435 - WARNING - init.py:43 - Using embedded DuckDB with persistence: data will be stored in: /root/autodl-tmp/localGPT/DB 2023-06-17 23:03:39,440 - INFO - ctypes.py:22 - Successfully imported ClickHouse Connect C data optimizations 2023-06-17 23:03:39,444 - INFO - json_impl.py:45 - Using python library for writing JSON byte strings 2023-06-17 23:03:39,480 - INFO - duckdb.py:454 - No existing DB found in /root/autodl-tmp/localGPT/DB, skipping load 2023-06-17 23:03:39,480 - INFO - duckdb.py:466 - No existing DB found in /root/autodl-tmp/localGPT/DB, skipping load ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /root/autodl-tmp/localGPT/ingest.py:140 in │
│ │
│ 137 │ logging.basicConfig( │
│ 138 │ │ format="%(asctime)s - %(levelname)s - %(filename)s:%(lineno)s - %(message)s", le │
│ 139 │ ) │
│ ❱ 140 │ main() │
│ 141 │
│ │
│ /root/miniconda3/envs/py310/lib/python3.10/site-packages/click/core.py:1130 in call │
│ │
│ 1127 │ │
│ 1128 │ def call(self, *args: t.Any, kwargs: t.Any) -> t.Any: │
│ 1129 │ │ """Alias for :meth:
main
.""" │ │ ❱ 1130 │ │ return self.main(*args, kwargs) │ │ 1131 │ │ 1132 │ │ 1133 class Command(BaseCommand): │ │ │ │ /root/miniconda3/envs/py310/lib/python3.10/site-packages/click/core.py:1055 in main │ │ │ │ 1052 │ │ try: │ │ 1053 │ │ │ try: │ │ 1054 │ │ │ │ with self.make_context(prog_name, args, extra) as ctx: │ │ ❱ 1055 │ │ │ │ │ rv = self.invoke(ctx) │ │ 1056 │ │ │ │ │ if not standalone_mode: │ │ 1057 │ │ │ │ │ │ return rv │ │ 1058 │ │ │ │ │ # it's not safe toctx.exit(rv)
here! │ │ │ │ /root/miniconda3/envs/py310/lib/python3.10/site-packages/click/core.py:1404 in invoke │ │ │ │ 1401 │ │ │ echo(style(message, fg="red"), err=True) │ │ 1402 │ │ │ │ 1403 │ │ if self.callback is not None: │ │ ❱ 1404 │ │ │ return ctx.invoke(self.callback, *ctx.params) │ │ 1405 │ │ │ 1406 │ def shell_complete(self, ctx: Context, incomplete: str) -> t.List["CompletionItem"]: │ │ 1407 │ │ """Return a list of completions for the incomplete value. Looks │ │ │ │ /root/miniconda3/envs/py310/lib/python3.10/site-packages/click/core.py:760 in invoke │ │ │ │ 757 │ │ │ │ 758 │ │ with augment_usage_errors(self): │ │ 759 │ │ │ with ctx: │ │ ❱ 760 │ │ │ │ return callback(args, kwargs) │ │ 761 │ │ │ 762 │ def forward( │ │ 763 │ │ self, cmd: "Command", *args: t.Any, kwargs: t.Any # noqa: B902 │ │ │ │ /root/autodl-tmp/localGPT/ingest.py:126 in main │ │ │ │ 123 │ │ │ 124 │ # embeddings = HuggingFaceEmbeddings(model_name=EMBEDDING_MODEL_NAME) │ │ 125 │ │ │ ❱ 126 │ db = Chroma.from_documents( │ │ 127 │ │ texts, │ │ 128 │ │ embeddings, │ │ 129 │ │ persist_directory=PERSIST_DIRECTORY, │ │ │ │ /root/miniconda3/envs/py310/lib/python3.10/site-packages/langchain/vectorstores/chroma.py:446 in │ │ from_documents │ │ │ │ 443 │ │ """ │ │ 444 │ │ texts = [doc.page_content for doc in documents] │ │ 445 │ │ metadatas = [doc.metadata for doc in documents] │ │ ❱ 446 │ │ return cls.from_texts( │ │ 447 │ │ │ texts=texts, │ │ 448 │ │ │ embedding=embedding, │ │ 449 │ │ │ metadatas=metadatas, │ │ │ │ /root/miniconda3/envs/py310/lib/python3.10/site-packages/langchain/vectorstores/chroma.py:414 in │ │ from_texts │ │ │ │ 411 │ │ │ client_settings=client_settings, │ │ 412 │ │ │ client=client, │ │ 413 │ │ ) │ │ ❱ 414 │ │ chroma_collection.add_texts(texts=texts, metadatas=metadatas, ids=ids) │ │ 415 │ │ return chroma_collection │ │ 416 │ │ │ 417 │ @classmethod │ │ │ │ /root/miniconda3/envs/py310/lib/python3.10/site-packages/langchain/vectorstores/chroma.py:159 in │ │ addtexts │ │ │ │ 156 │ │ │ ids = [str(uuid.uuid1()) for in texts] │ │ 157 │ │ embeddings = None │ │ 158 │ │ if self._embedding_function is not None: │ │ ❱ 159 │ │ │ embeddings = self._embedding_function.embed_documents(list(texts)) │ │ 160 │ │ self._collection.add( │ │ 161 │ │ │ metadatas=metadatas, embeddings=embeddings, documents=texts, ids=ids │ │ 162 │ │ ) │ │ │ │ /root/miniconda3/envs/py310/lib/python3.10/site-packages/langchain/embeddings/huggingface.py:158 │ │ in embed_documents │ │ │ │ 155 │ │ │ List of embeddings, one for each text. │ │ 156 │ │ """ │ │ 157 │ │ instruction_pairs = [[self.embed_instruction, text] for text in texts] │ │ ❱ 158 │ │ embeddings = self.client.encode(instruction_pairs, self.encode_kwargs) │ │ 159 │ │ return embeddings.tolist() │ │ 160 │ │ │ 161 │ def embed_query(self, text: str) -> List[float]: │ │ │ │ /root/miniconda3/envs/py310/lib/python3.10/site-packages/InstructorEmbedding/instructor.py:524 │ │ in encode │ │ │ │ 521 │ │ self.to(device) │ │ 522 │ │ │ │ 523 │ │ all_embeddings = [] │ │ ❱ 524 │ │ if isinstance(sentences[0],list): │ │ 525 │ │ │ lengths = [] │ │ 526 │ │ │ for sen in sentences: │ │ 527 │ │ │ │ lengths.append(-self._text_length(sen[1])) │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ IndexError: list index out of range 2023-06-17 23:03:41,161 - INFO - duckdb.py:414 - Persisting DB to disk, putting it in the save folder: /root/autodl-tmp/localGPT/DB (py310) root@autodl-container-a4da118bfa-64e0dc0a:~/autodl-tmp/localGPT#