chroma-core / chroma

the AI-native open-source embedding database
https://www.trychroma.com/
Apache License 2.0
15.13k stars 1.27k forks source link

[Bug]: collection.query RuntimeError: Cannot open header file #872

Open huanggefan opened 1 year ago

huanggefan commented 1 year ago

What happened?

1. init data

  1. chroma_client = chromadb.PersistentClient(path=str(collection_dir))
  2. chroma_collection = chroma_client.get_or_create_collection(name=collection_name)
  3. add some data chroma_collection.upsert(ids=[id], embeddings=[embedding], documents=[document], metadatas=[metadata])
  4. stop this program

2. query and RuntimeError except

  1. collection.query(query_embeddings=query_embeddings, n_results=100)

3. remove index_metadata.pickle

  1. rm 64597cda-24ba-4d7a-8fc4-b96f1fc098d9/index_metadata.pickle

4. query and no RuntimeError except

5. restart program and query and RuntimeError except

6. remove index_metadata.pickle and restart program and query and RuntimeError no except

so, when index_metadata.pickle exists, will except RuntimeError

Versions

chromadb 0.4.2 python 3.11.4 windows 11

Relevant log output

Call function: query_result = collection.query(query_embeddings=query_embeddings, n_results=100)

File "python-env\Lib\site-packages\chromadb\api\models\Collection.py", line 223, in query
    return self._client._query(
           ^^^^^^^^^^^^^^^^^^^^
  File "python-env\Lib\site-packages\chromadb\api\segment.py", line 433, in _query
    vector_reader = self._manager.get_segment(collection_id, VectorReader)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "python-env\Lib\site-packages\chromadb\segment\impl\manager\local.py", line 112, in get_segment
    instance = self._instance(self._segment_cache[collection_id][scope])
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "python-env\Lib\site-packages\chromadb\segment\impl\manager\local.py", line 131, in _instance
    instance = cls(self._system, segment)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "python-env\Lib\site-packages\chromadb\segment\impl\vector\local_persistent_hnsw.py", line 107, in __init__
    self._init_index(self._dimensionality)
  File "python-env\Lib\site-packages\chromadb\segment\impl\vector\local_persistent_hnsw.py", line 149, in _init_index
    index.load_index(
RuntimeError: Cannot open header file
jeffchuber commented 1 year ago

@huanggefan why are you removing index_metadata.pickle?

huanggefan commented 1 year ago

when removing index_metadata.pickle it can work, if index_metadata.pickle exists, it will except RuntimeError

jeffchuber commented 1 year ago

@huanggefan i see. our tests havent been able to pick this up, and we do test this quite a bit.

here are a few ideas from chatgpt...


The error traceback you've provided suggests that the program is having trouble opening a file, specifically a header file. This could be due to a number of reasons:

File Doesn't Exist: The file it's trying to open might not exist. Verify that the file is indeed present at the expected location.

Incorrect Path: If the file does exist, make sure the program is looking for it in the right place. The path might be relative or absolute. If it's relative, it's relative to the working directory when you started the program.

File Permissions: The program may not have sufficient permissions to open the file. Check the permissions on the file and ensure that the user running the program has the necessary permissions to read the file.

File is Being Used by Another Process: The file could be locked or being used by another process. Make sure no other processes are using the file when you're trying to run your program.

To resolve this issue, try to manually access the file path the software is trying to use. If you can access it, and the file is there, check if the file is locked or if the process has appropriate permissions to access it. Also, ensure that the file is not currently being used by another process. If all these are fine, the problem might be with the software itself - perhaps it's not handling paths correctly or it's incorrectly formulating the path to the file.

HammadB commented 1 year ago

This error indicates the index did not intialize or save.

Does the file 64597cda-24ba-4d7a-8fc4-b96f1fc098d9/header.bin exist?

By deleting files a you have, you may have corrupted your db and need to restart from scratch.

huanggefan commented 1 year ago

I run as local

chroma_client = chromadb.PersistentClient(path=str(self.collection_dir))
chroma_collection = chroma_client.get_or_create_collection(name=self.name)

chroma_collection.upsert(ids=[i_d], embeddings=[chunk_embedding], documents=[document_chunk], metadatas=[metadata])
...

chroma_client.stop()

only index_metadata.pickle, no header.bin

@HammadB

HammadB commented 1 year ago

Hmmm, that does not make sense. What version of chroma-hnswlib do you have installed?

huanggefan commented 1 year ago

i find chroma_hnswlib at env\Lib\site-packages\chroma_hnswlib-0.7.1.dist-info chroma_hnswlib 0.7.1 @HammadB

huanggefan commented 1 year ago

i try use chroma_hnswlib 0.7.2, but look the same error: RuntimeError: Cannot open header file @HammadB

HammadB commented 1 year ago

@huanggefan, does this happen when you use a clean directory? The implication of this error is that somehow the underlying vector index has not been created, but since you have deleted files, the index is in a halfway state. When the folder exists, we assume the index has been created and initialized properly.

I am trying to understand if the error is because of your partially constructed index or because of some deeper issue. Can you share the script you are using to reproduce the issue?

huanggefan commented 1 year ago

I found the key to the problem

  1. when collection_dir at disk (SDD or HDD), [data_level0.bin, header.bin, length.bin, link_lists.bin], these files do not exist
  2. when collection_dir at memory (RamDisk), both *.bin files exist, and everything is normal now

@HammadB

huanggefan commented 1 year ago

here is my code issues 872

run it python collection/zh_wikipedia_org.py

need install tqdm

@HammadB

HammadB commented 1 year ago

@huanggefan - interesting. I will take a look at your code. But to me this implies something strange with the ability to create and use the files on SSD/HDD vs other disk types. Have not seen that issue before, many people are running on SSD/HDD just fine. Do you have any insight into your specific platform and why this might be the case?

huanggefan commented 1 year ago

I set collection_dir to RamDisk and ran the code for a while. After that, I found a clean folder on the SSD, clear collection_dir and re-imported the data. Surprisingly, everything is working fine now.

This problem is quite strange now. I'll try restarting the computer a few times and check it out after I finish work.

@HammadB

huanggefan commented 1 year ago

at now version: chroma-hnswlib 0.7.2 chromadb 0.4.7

When I noticed that the disk was no longer writing data, meaning that files like index_metadata.pickle were no longer changing I have encountered this error,I'm not sure if this is related to the issue.

Exception occurred invoking consumer for subscription acc30d643e2541a7bc0eebf1caec76e2to topic persistent://default/default/1ea6b644-e293-4cdf-ae8c-35e709d9ad8d Index with capacity 100 and 100 current entries cannot add 1 records

@HammadB @jeffchuber 
okamura-0422 commented 1 year ago

I experienced this same phenomenon when using Japanese directory names. I solved this problem by using English directory names.

HammadB commented 11 months ago

I think we should atomically rename a new index_metadata instead of overwriting. I can only suspect that the file got corrupted

OriginalGoku commented 11 months ago

I am having the same probelm. I run my code on Google Colab.

# Chroma Location
chroma_location = "chroma_24_nov_2023"
vectorDBlocation = os.path.join("/content/drive/My Drive/", chroma_location)
client = chromadb.PersistentClient(path=vectorDBlocation)

# Embedder
sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-mpnet-base-v2", device=processing_device)

# collection_name 
collection_name = 'ABC_Collection'

distance = 'cosine' 

 # Get a collection object from an existing collection, by name. If it doesn't exist, create it.
collection = client.get_or_create_collection(name=collection_name,
                                             embedding_function=sentence_transformer_ef,
                                             metadata={"hnsw:space": distance}) 

# Use this function to add data to the collection
def add_to_collection(ids, text_chunks, extended_meta_data):
    collection.add(
        ids=ids,
        documents=text_chunks,
        metadatas=extended_meta_data
    )

The code was running fine the first time data was added to the collection. When I restarted the kernel and tried to use the system the next morning, I started getting the same error.

For example collection.peek(), returned the following error:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
[<ipython-input-18-d0cc5020d39d>](https://localhost:8080/#) in <cell line: 1>()
----> 1 collection.peek()

10 frames
[/usr/local/lib/python3.10/dist-packages/chromadb/api/models/Collection.py](https://localhost:8080/#) in peek(self, limit)
    235             GetResult: A GetResult object containing the results.
    236         """
--> 237         return self._client._peek(self.id, limit)
    238 
    239     def query(

[/usr/local/lib/python3.10/dist-packages/chromadb/telemetry/opentelemetry/__init__.py](https://localhost:8080/#) in wrapper(*args, **kwargs)
    125             global tracer, granularity
    126             if trace_granularity < granularity:
--> 127                 return f(*args, **kwargs)
    128             if not tracer:
    129                 return f(*args, **kwargs)

[/usr/local/lib/python3.10/dist-packages/chromadb/api/segment.py](https://localhost:8080/#) in _peek(self, collection_id, n)
    745     def _peek(self, collection_id: UUID, n: int = 10) -> GetResult:
    746         add_attributes_to_current_span({"collection_id": str(collection_id)})
--> 747         return self._get(collection_id, limit=n)  # type: ignore
    748 
    749     @override

[/usr/local/lib/python3.10/dist-packages/chromadb/telemetry/opentelemetry/__init__.py](https://localhost:8080/#) in wrapper(*args, **kwargs)
    125             global tracer, granularity
    126             if trace_granularity < granularity:
--> 127                 return f(*args, **kwargs)
    128             if not tracer:
    129                 return f(*args, **kwargs)

[/usr/local/lib/python3.10/dist-packages/chromadb/api/segment.py](https://localhost:8080/#) in _get(self, collection_id, ids, where, sort, limit, offset, page, page_size, where_document, include)
    508         if "embeddings" in include:
    509             vector_ids = [r["id"] for r in records]
--> 510             vector_segment = self._manager.get_segment(collection_id, VectorReader)
    511             vectors = vector_segment.get_vectors(ids=vector_ids)
    512 

[/usr/local/lib/python3.10/dist-packages/chromadb/telemetry/opentelemetry/__init__.py](https://localhost:8080/#) in wrapper(*args, **kwargs)
    125             global tracer, granularity
    126             if trace_granularity < granularity:
--> 127                 return f(*args, **kwargs)
    128             if not tracer:
    129                 return f(*args, **kwargs)

[/usr/local/lib/python3.10/dist-packages/chromadb/segment/impl/manager/local.py](https://localhost:8080/#) in get_segment(self, collection_id, type)
    157         # creates the instance.
    158         with self._lock:
--> 159             instance = self._instance(self._segment_cache[collection_id][scope])
    160         return cast(S, instance)
    161 

[/usr/local/lib/python3.10/dist-packages/chromadb/segment/impl/manager/local.py](https://localhost:8080/#) in _instance(self, segment)
    186         if segment["id"] not in self._instances:
    187             cls = self._cls(segment)
--> 188             instance = cls(self._system, segment)
    189             instance.start()
    190             self._instances[segment["id"]] = instance

[/usr/local/lib/python3.10/dist-packages/chromadb/segment/impl/vector/local_persistent_hnsw.py](https://localhost:8080/#) in __init__(self, system, segment)
    117             if len(self._id_to_label) > 0:
    118                 self._dimensionality = cast(int, self._dimensionality)
--> 119                 self._init_index(self._dimensionality)
    120         else:
    121             self._persist_data = PersistentData(

[/usr/local/lib/python3.10/dist-packages/chromadb/telemetry/opentelemetry/__init__.py](https://localhost:8080/#) in wrapper(*args, **kwargs)
    125             global tracer, granularity
    126             if trace_granularity < granularity:
--> 127                 return f(*args, **kwargs)
    128             if not tracer:
    129                 return f(*args, **kwargs)

[/usr/local/lib/python3.10/dist-packages/chromadb/segment/impl/vector/local_persistent_hnsw.py](https://localhost:8080/#) in _init_index(self, dimensionality)
    162         # Check if index exists and load it if it does
    163         if self._index_exists():
--> 164             index.load_index(
    165                 self._get_storage_folder(),
    166                 is_persistent_index=True,

RuntimeError: Cannot open header file

And when I look at the "chroma_24_nov_2023" directory I only see the "chroma.sqlite3" in the main folder and a single file called "index_metadata.pickle" in the "8c313532-435c-45a4-97dc-46b92f45e71a" directory

Screenshot 2023-11-25 at 9 43 51 AM Screenshot 2023-11-25 at 9 43 59 AM

But strangely, when I deleted the "8c313532-435c-45a4-97dc-46b92f45e71a" directory and run the code again, it works. I get a new directory called "8c313532-435c-45a4-97dc-46b92f45e71a" (The exact same name as before) but everything works now! This is a serious problem if we want to use this code in production! Any fix?

weili02201 commented 5 months ago

I have the same problem, there is no solution?

nnnnwinder commented 4 months ago

I also encountered the same problem, there is only one index_metadata.pickle file in the folder, after running the search error, cannot open header file, and subsequently cannot add files and search documents in this collection, may I ask anyone to solve it?

weili02201 commented 4 months ago

In my opinion, it's because there is 1000 limited data of the database. if you change the number 1000 into 10000 in all of the py file, you may solve the problem.

---- Replied Message ---- | From | @.> | | Date | 06/03/2024 15:22 | | To | @.> | | Cc | @.>@.> | | Subject | Re: [chroma-core/chroma] [Bug]: collection.query RuntimeError: Cannot open header file (Issue #872) |

I also encountered the same problem, there is only one index_metadata.pickle file in the folder, after running the search error, cannot open header file, and subsequently cannot add files and search documents in this collection, may I ask anyone to solve it?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

nnnnwinder commented 4 months ago

Thank you, I found the reason, is the folder needs to be named in English, otherwise there will be this error. @okamura-0422

nnnnwinder commented 4 months ago

In my opinion, it's because there is 1000 limited data of the database. if you change the number 1000 into 10000 in all of the py file, you may solve the problem. ---- Replied Message ---- | From | @.> | | Date | 06/03/2024 15:22 | | To | @.> | | Cc | @.>@.> | | Subject | Re: [chroma-core/chroma] [Bug]: collection.query RuntimeError: Cannot open header file (Issue #872) | I also encountered the same problem, there is only one index_metadata.pickle file in the folder, after running the search error, cannot open header file, and subsequently cannot add files and search documents in this collection, may I ask anyone to solve it? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

Thank you, I found the reason, is the folder needs to be named in English, otherwise there will be this error

NikLaz25 commented 2 months ago

Можно удалить файл index_metadata.pickle и всё заработает!

Указание пути к файлу index_metadata.pickle

index_metadata_path = PATHDB+'/a4f2f538-d3b2-4906-81b9-de5e77c40d9e/index_metadata.pickle'

Удаление файла, если он существует

if os.path.exists(index_metadata_path): os.remove(index_metadata_path) print(f"Файл {index_metadata_path} был удален.")

NikLaz25 commented 2 months ago

Исправлять название папок на англоязычные нужно до того как файл index_metadata.pickle будет автоматически создан базой при первом обращении к ней. Короче, сначала его нужно удалить, а потом исправлять название всего пути на англоязычные папки. В этом случае, когда файл index_metadata.pickle создастся очередной раз при обращении к базе, там уже всё будет корректно и в дальнейшем его не нужно будет удалять. Изначальная причина ошибок в русскоязычных названиях папок в адресе расположения базы данных.

victorleeasu commented 3 weeks ago

I'm bumping into a similar problem using colab with google drive. [TLDR] Workaround is to use the older version that has manual .persist method !pip install chromadb==0.3.29

`[/usr/local/lib/python3.10/dist-packages/chromadb/segment/impl/vector/local_persistent_hnsw.py](https://localhost:8080/#) in _init_index(self, dimensionality)
    206         # Check if index exists and load it if it does
    207         if self._index_exists():
--> 208             index.load_index(
    209                 self._get_storage_folder(),
    210                 is_persistent_index=True,

RuntimeError: Cannot open data_level0 file`

First time making the collection, everything works just fine. But cannot load the db again. The file is just not being created. I don't have non-eng characters in the path. I'm going to downgrade the chromadb version and try see if .persist() can force create the file.


Update: the old version doesn't support id string format containing ' character, so it took a while. But, it seems forcing the persist will save everything in my drive as expected. It makes sense coz constant rewrites would be a problem corrupting the filesystem.