Open ghost opened 9 months ago
@sachinchawla, you are using a relatively old version of Chroma in which Chroma data was stored internally in the container unless you have- a custom docker compose or docker command with mounts. If you are running on Linux, this might not be a problem, but on Windows and Mac, where docker runs in a VM.
Traceback (most recent call last):
File "/home/richard/book-mentat/src/chroma_info_custom.py", line 43, in <module>
batch = collection.get()
File "/home/richard/miniconda3/envs/mentat/lib/python3.10/site-packages/chromadb/api/models/Collection.py", line 211, in get
get_results = self._client._get(
File "/home/richard/miniconda3/envs/mentat/lib/python3.10/site-packages/chromadb/telemetry/opentelemetry/__init__.py", line 143, in wrapper
return f(*args, **kwargs)
File "/home/richard/miniconda3/envs/mentat/lib/python3.10/site-packages/chromadb/rate_limiting/__init__.py", line 45, in wrapper
return f(self, *args, **kwargs)
File "/home/richard/miniconda3/envs/mentat/lib/python3.10/site-packages/chromadb/api/segment.py", line 517, in _get
records = metadata_segment.get_metadata(
File "/home/richard/miniconda3/envs/mentat/lib/python3.10/site-packages/chromadb/telemetry/opentelemetry/__init__.py", line 143, in wrapper
return f(*args, **kwargs)
File "/home/richard/miniconda3/envs/mentat/lib/python3.10/site-packages/chromadb/segment/impl/metadata/sqlite.py", line 216, in get_metadata
return list(self._records(cur, q))
File "/home/richard/miniconda3/envs/mentat/lib/python3.10/site-packages/chromadb/segment/impl/metadata/sqlite.py", line 225, in _records
cur.execute(sql, params)
sqlite3.OperationalError: database or disk is full
database is 37GB - so plenty of memory available - is on a drive with 2TB free - is there some sort of temp space issue problem?
chroma 0.2.0 pypi_0 pypi
chroma-hnswlib 0.7.3 pypi_0 pypi
chromadb 0.5.0 pypi_0 pypi
python 3.10.14 hd12c33a_0_cpython conda-forge
This is on trying to query - database is still allowing data to go in.
@RichardScottOZ, if you are running in a container, can you run:
docker exec -it <container_name_or_id> df -h /chroma/chroma
Let's see what your container reports as spare disk size.
Hi, thanks. Not running in a container, just installed it on a ubuntu server.
A note - I thought it could have been the size of the get, so I tried this:
Traceback (most recent call last):
File "/home/richard/book-mentat/src/chroma_info_custom_loop.py", line 46, in <module>
ids_only_result = collection.get(include=[])
File "/home/richard/miniconda3/envs/mentat/lib/python3.10/site-packages/chromadb/api/models/Collection.py", line 211, in get
get_results = self._client._get(
File "/home/richard/miniconda3/envs/mentat/lib/python3.10/site-packages/chromadb/telemetry/opentelemetry/__init__.py", line 143, in wrapper
return f(*args, **kwargs)
File "/home/richard/miniconda3/envs/mentat/lib/python3.10/site-packages/chromadb/rate_limiting/__init__.py", line 45, in wrapper
return f(self, *args, **kwargs)
File "/home/richard/miniconda3/envs/mentat/lib/python3.10/site-packages/chromadb/api/segment.py", line 517, in _get
records = metadata_segment.get_metadata(
File "/home/richard/miniconda3/envs/mentat/lib/python3.10/site-packages/chromadb/telemetry/opentelemetry/__init__.py", line 143, in wrapper
return f(*args, **kwargs)
File "/home/richard/miniconda3/envs/mentat/lib/python3.10/site-packages/chromadb/segment/impl/metadata/sqlite.py", line 216, in get_metadata
return list(self._records(cur, q))
File "/home/richard/miniconda3/envs/mentat/lib/python3.10/site-packages/chromadb/segment/impl/metadata/sqlite.py", line 225, in _records
cur.execute(sql, params)
sqlite3.OperationalError: database or disk is full
Is there some sort of integer limit or anything this might hit? It is late, I have not looked at the repo code as yet to try and work it out, will do tomorrow.
I can query a model using an index fine - so it seems like it is a collection information issue, not a db issue.
hey @RichardScottOZ, thanks for confirming let's do the following:
See how much space you have in persist dir:
df -h /path/to/chroma_persist
Let's check how much space you have in your /tmp
although I'm skeptical sqlite3 uses it:
df -h /tmp
Check the max_page_count
of the SQLite:
sqlite3 /path/to/chroma_persist/chroma.sqlite3 "PRAGMA max_page_count;"
the disk chroma is on has 2.5 TB free, tmp has 8 gb
on page count sqlite3 python?
@RichardScottOZ, if you are on Linux you can install the sqlite3 library e.g. for Debian-based distros sudo apt update && sudo apt install sqlite3
then sqlite3
executable will be in your path. Once installed, you can copy and paste (adjust the path) the above example.
yeah, had never needed it - will take a look
$ sqlite3 /mnt/usb_mount/chroma/Calibre\ Books/chroma.sqlite3 "PRAGMA max_page_count;"
1073741823
quite a big number
@RichardScottOZ, you are right. 1073741823 pages * 4096 bytes per page ~ 4.4TB max size of the sqlite3 file
. So the size of your sqlite3 file (37GB
) is not a problem and we can rule it out.
Let's examine the nature of your workload now. You said that ingestion is fine, but the query causes an issue. Can you elaborate on your query? Can you share a snippet + how many results do you expect it to return?
when it started not working, likely had 7000 books? was trying to get the names of all them to list in alpha order where they were up to
this is a bit convoluted, but was working previously:
batch = collection.get()
print(len(batch))
for b in batch:
print(b)
count = 0
file_dict = {}
for x in range(len(batch["documents"])):
doc = batch["metadatas"][x]
print(doc['file_name'])
count += 1
file_dict[doc['file_name']] = 1
print(count)
print(file_dict)
print(len(file_dict))
sorted_dict = dict(sorted(file_dict.items()))
for key in sorted_dict:
print(key)
print(len(sorted_dict))
@RichardScottOZ, ok I think I understand now what might be the culprit here. SQLite uses temp storage for large result sets. In your case it ends up in /tmp
(see https://www.sqlite.org/tempfiles.html). On a 37GB DB, there is a good chance that your collection.get()
returns a huge number of results, thus overflowing /tmp
storage capacity (hence the error). It is possible to specify the temp path via PRAGMA
, but that is a code change in Chroma that we need to consider further.
In the meantime, can I ask you to try and paginate your collection.get()
(see this code snippet for inspiration - https://cookbook.chromadb.dev/core/collections/#cloning-a-collection). Let me know the results.
So temp space as considered above. Will try the above tomorrow thanks.
splitting into sizeable chunks worked for the above use anyway, thanks
Hi @tazarov, I am facing the same issue with the code below. Is this fixed yet or what is the current work around?
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_nomic.embeddings import NomicEmbeddings
vectorstore = Chroma.from_documents(
documents=doc_splits,
collection_name="rag-chroma",
embedding=NomicEmbeddings(model="nomic-embed-text-v1.5", inference_mode="local"),
)
retriever = vectorstore.as_retriever()
Here is the output error:
File /srv/data/anaconda3/envs/chask/lib/python3.10/site-packages/chromadb/telemetry/opentelemetry/__init__.py:146, in trace_method.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
144 global tracer, granularity
145 if trace_granularity < granularity:
--> 146 return f(*args, **kwargs)
147 if not tracer:
148 return f(*args, **kwargs)
File /srv/data/anaconda3/envs/chask/lib/python3.10/site-packages/chromadb/api/segment.py:445, in SegmentAPI._upsert(self, collection_id, ids, embeddings, metadatas, documents, uris)
434 records_to_submit = list(
435 _records(
436 t.Operation.UPSERT,
(...)
442 )
443 )
444 self._validate_embedding_record_set(coll, records_to_submit)
--> 445 self._producer.submit_embeddings(collection_id, records_to_submit)
447 return True
File /srv/data/anaconda3/envs/chask/lib/python3.10/site-packages/chromadb/telemetry/opentelemetry/__init__.py:146, in trace_method.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
144 global tracer, granularity
145 if trace_granularity < granularity:
--> 146 return f(*args, **kwargs)
147 if not tracer:
148 return f(*args, **kwargs)
File /srv/data/anaconda3/envs/chask/lib/python3.10/site-packages/chromadb/db/mixins/embeddings_queue.py:239, in SqlEmbeddingsQueue.submit_embeddings(self, collection_id, embeddings)
236 # The returning clause does not guarantee order, so we need to do reorder
237 # the results. https://www.sqlite.org/lang_returning.html
238 sql = f"{sql} RETURNING seq_id, id" # Pypika doesn't support RETURNING
--> 239 results = cur.execute(sql, params).fetchall()
240 # Reorder the results
241 seq_ids = [cast(SeqId, None)] * len(
242 results
243 ) # Lie to mypy: https://stackoverflow.com/questions/76694215/python-type-casting-when-preallocating-list
OperationalError: database or disk is full
Here is my /tmp
space allocation:
What happened?
What Happened:
Expected Behavior:
Versions
ChromaDB V 0.4.9 Python 3.10
Relevant log output