[Bug]: sqlite3.OperationalError: database or disk is full

ghost commented 9 months ago

What happened?

What Happened:

Encountered an error with a SQLite database in a Docker container environment.
The error message was sqlite3.OperationalError: database or disk is full.
This issue occurred despite the host machine having sufficient disk space.
The SQLite database file size was found to be approximately 4.1 GB.
The Docker container settings and host machine settings were checked for potential causes of the error.

Expected Behavior:

The SQLite database should operate without encountering a 'disk is full' error, especially considering that the host machine had adequate disk space.
Given the size of the SQLite file (4.1 GB) and the typical capabilities of SQLite and the Docker environment, normal database operations such as data insertion, updating, and querying were expected to occur without errors related to disk space.
The expectation was that the Docker container's configuration and the host system's file system would support the operation of a database of this size without triggering disk space-related errors.

Versions

ChromaDB V 0.4.9 Python 3.10

Relevant log output

sqlite3.OperationalError: database or disk is full
INFO:     [02-02-2024 04:10:33] 3.131.62.47:40862 - "POST /api/v1/collections/559a54f0-9471-48af-98af-4d19c5fbd2db/add HTTP/1.1" 500
INFO:     [02-02-2024 04:10:33] 3.131.62.47:40862 - "POST /api/v1/collections/559a54f0-9471-48af-98af-4d19c5fbd2db/query HTTP/1.1" 200
ERROR:    [02-02-2024 04:10:34] database or disk is full

tazarov commented 9 months ago

@sachinchawla, you are using a relatively old version of Chroma in which Chroma data was stored internally in the container unless you have- a custom docker compose or docker command with mounts. If you are running on Linux, this might not be a problem, but on Windows and Mac, where docker runs in a VM.

RichardScottOZ commented 6 months ago

Traceback (most recent call last):
  File "/home/richard/book-mentat/src/chroma_info_custom.py", line 43, in <module>
    batch = collection.get()
  File "/home/richard/miniconda3/envs/mentat/lib/python3.10/site-packages/chromadb/api/models/Collection.py", line 211, in get
    get_results = self._client._get(
  File "/home/richard/miniconda3/envs/mentat/lib/python3.10/site-packages/chromadb/telemetry/opentelemetry/__init__.py", line 143, in wrapper
    return f(*args, **kwargs)
  File "/home/richard/miniconda3/envs/mentat/lib/python3.10/site-packages/chromadb/rate_limiting/__init__.py", line 45, in wrapper
    return f(self, *args, **kwargs)
  File "/home/richard/miniconda3/envs/mentat/lib/python3.10/site-packages/chromadb/api/segment.py", line 517, in _get
    records = metadata_segment.get_metadata(
  File "/home/richard/miniconda3/envs/mentat/lib/python3.10/site-packages/chromadb/telemetry/opentelemetry/__init__.py", line 143, in wrapper
    return f(*args, **kwargs)
  File "/home/richard/miniconda3/envs/mentat/lib/python3.10/site-packages/chromadb/segment/impl/metadata/sqlite.py", line 216, in get_metadata
    return list(self._records(cur, q))
  File "/home/richard/miniconda3/envs/mentat/lib/python3.10/site-packages/chromadb/segment/impl/metadata/sqlite.py", line 225, in _records
    cur.execute(sql, params)
sqlite3.OperationalError: database or disk is full

database is 37GB - so plenty of memory available - is on a drive with 2TB free - is there some sort of temp space issue problem?

chroma                    0.2.0                    pypi_0    pypi
chroma-hnswlib            0.7.3                    pypi_0    pypi
chromadb                  0.5.0                    pypi_0    pypi
python                    3.10.14         hd12c33a_0_cpython    conda-forge

RichardScottOZ commented 6 months ago

This is on trying to query - database is still allowing data to go in.

tazarov commented 6 months ago

@RichardScottOZ, if you are running in a container, can you run:

docker exec -it <container_name_or_id>  df -h /chroma/chroma

Let's see what your container reports as spare disk size.

RichardScottOZ commented 6 months ago

Hi, thanks. Not running in a container, just installed it on a ubuntu server.

A note - I thought it could have been the size of the get, so I tried this:

Traceback (most recent call last):
  File "/home/richard/book-mentat/src/chroma_info_custom_loop.py", line 46, in <module>
    ids_only_result = collection.get(include=[])
  File "/home/richard/miniconda3/envs/mentat/lib/python3.10/site-packages/chromadb/api/models/Collection.py", line 211, in get
    get_results = self._client._get(
  File "/home/richard/miniconda3/envs/mentat/lib/python3.10/site-packages/chromadb/telemetry/opentelemetry/__init__.py", line 143, in wrapper
    return f(*args, **kwargs)
  File "/home/richard/miniconda3/envs/mentat/lib/python3.10/site-packages/chromadb/rate_limiting/__init__.py", line 45, in wrapper
    return f(self, *args, **kwargs)
  File "/home/richard/miniconda3/envs/mentat/lib/python3.10/site-packages/chromadb/api/segment.py", line 517, in _get
    records = metadata_segment.get_metadata(
  File "/home/richard/miniconda3/envs/mentat/lib/python3.10/site-packages/chromadb/telemetry/opentelemetry/__init__.py", line 143, in wrapper
    return f(*args, **kwargs)
  File "/home/richard/miniconda3/envs/mentat/lib/python3.10/site-packages/chromadb/segment/impl/metadata/sqlite.py", line 216, in get_metadata
    return list(self._records(cur, q))
  File "/home/richard/miniconda3/envs/mentat/lib/python3.10/site-packages/chromadb/segment/impl/metadata/sqlite.py", line 225, in _records
    cur.execute(sql, params)
sqlite3.OperationalError: database or disk is full

Is there some sort of integer limit or anything this might hit? It is late, I have not looked at the repo code as yet to try and work it out, will do tomorrow.

I can query a model using an index fine - so it seems like it is a collection information issue, not a db issue.

tazarov commented 6 months ago

hey @RichardScottOZ, thanks for confirming let's do the following:

See how much space you have in persist dir:

df -h /path/to/chroma_persist

Let's check how much space you have in your /tmp although I'm skeptical sqlite3 uses it:

df -h /tmp

Check the max_page_count of the SQLite:

sqlite3 /path/to/chroma_persist/chroma.sqlite3 "PRAGMA max_page_count;"

RichardScottOZ commented 6 months ago

the disk chroma is on has 2.5 TB free, tmp has 8 gb

RichardScottOZ commented 6 months ago

on page count sqlite3 python?

tazarov commented 6 months ago

@RichardScottOZ, if you are on Linux you can install the sqlite3 library e.g. for Debian-based distros sudo apt update && sudo apt install sqlite3 then sqlite3 executable will be in your path. Once installed, you can copy and paste (adjust the path) the above example.

RichardScottOZ commented 6 months ago

yeah, had never needed it - will take a look

RichardScottOZ commented 6 months ago

$ sqlite3 /mnt/usb_mount/chroma/Calibre\ Books/chroma.sqlite3 "PRAGMA max_page_count;"
1073741823

quite a big number

tazarov commented 6 months ago

@RichardScottOZ, you are right. 1073741823 pages * 4096 bytes per page ~ 4.4TB max size of the sqlite3 file. So the size of your sqlite3 file (37GB) is not a problem and we can rule it out.

Let's examine the nature of your workload now. You said that ingestion is fine, but the query causes an issue. Can you elaborate on your query? Can you share a snippet + how many results do you expect it to return?

RichardScottOZ commented 6 months ago

when it started not working, likely had 7000 books? was trying to get the names of all them to list in alpha order where they were up to

this is a bit convoluted, but was working previously:

batch = collection.get()

print(len(batch))

for b in batch:
    print(b)

count = 0
file_dict = {}
for x in range(len(batch["documents"])):
    doc = batch["metadatas"][x]
    print(doc['file_name'])
    count += 1
    file_dict[doc['file_name']] = 1

print(count)    

print(file_dict)
print(len(file_dict))

sorted_dict = dict(sorted(file_dict.items()))
for key in sorted_dict:
    print(key)

print(len(sorted_dict))

tazarov commented 6 months ago

@RichardScottOZ, ok I think I understand now what might be the culprit here. SQLite uses temp storage for large result sets. In your case it ends up in /tmp (see https://www.sqlite.org/tempfiles.html). On a 37GB DB, there is a good chance that your collection.get() returns a huge number of results, thus overflowing /tmp storage capacity (hence the error). It is possible to specify the temp path via PRAGMA, but that is a code change in Chroma that we need to consider further.

In the meantime, can I ask you to try and paginate your collection.get() (see this code snippet for inspiration - https://cookbook.chromadb.dev/core/collections/#cloning-a-collection). Let me know the results.

RichardScottOZ commented 6 months ago

So temp space as considered above. Will try the above tomorrow thanks.

RichardScottOZ commented 6 months ago

splitting into sizeable chunks worked for the above use anyway, thanks

jvel07 commented 2 months ago

Hi @tazarov, I am facing the same issue with the code below. Is this fixed yet or what is the current work around?

from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_nomic.embeddings import NomicEmbeddings

vectorstore = Chroma.from_documents(
    documents=doc_splits,
    collection_name="rag-chroma",
    embedding=NomicEmbeddings(model="nomic-embed-text-v1.5", inference_mode="local"),
)
retriever = vectorstore.as_retriever()

Here is the output error:

File /srv/data/anaconda3/envs/chask/lib/python3.10/site-packages/chromadb/telemetry/opentelemetry/__init__.py:146, in trace_method.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    144 global tracer, granularity
    145 if trace_granularity < granularity:
--> 146     return f(*args, **kwargs)
    147 if not tracer:
    148     return f(*args, **kwargs)

File /srv/data/anaconda3/envs/chask/lib/python3.10/site-packages/chromadb/api/segment.py:445, in SegmentAPI._upsert(self, collection_id, ids, embeddings, metadatas, documents, uris)
    434 records_to_submit = list(
    435     _records(
    436         t.Operation.UPSERT,
   (...)
    442     )
    443 )
    444 self._validate_embedding_record_set(coll, records_to_submit)
--> 445 self._producer.submit_embeddings(collection_id, records_to_submit)
    447 return True

File /srv/data/anaconda3/envs/chask/lib/python3.10/site-packages/chromadb/telemetry/opentelemetry/__init__.py:146, in trace_method.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    144 global tracer, granularity
    145 if trace_granularity < granularity:
--> 146     return f(*args, **kwargs)
    147 if not tracer:
    148     return f(*args, **kwargs)

File /srv/data/anaconda3/envs/chask/lib/python3.10/site-packages/chromadb/db/mixins/embeddings_queue.py:239, in SqlEmbeddingsQueue.submit_embeddings(self, collection_id, embeddings)
    236 # The returning clause does not guarantee order, so we need to do reorder
    237 # the results. https://www.sqlite.org/lang_returning.html
    238 sql = f"{sql} RETURNING seq_id, id"  # Pypika doesn't support RETURNING
--> 239 results = cur.execute(sql, params).fetchall()
    240 # Reorder the results
    241 seq_ids = [cast(SeqId, None)] * len(
    242     results
    243 )  # Lie to mypy: https://stackoverflow.com/questions/76694215/python-type-casting-when-preallocating-list

OperationalError: database or disk is full

Here is my /tmp space allocation:

chroma-core / chroma