cozodb / cozo

A transactional, relational-graph-vector database that uses Datalog for query. The hippocampus for AI!
https://cozodb.org
Mozilla Public License 2.0
3.44k stars 108 forks source link

"Corrupted index" error when using the `bind_vector` option in hnsw index search #140

Closed creatorrr closed 1 year ago

creatorrr commented 1 year ago

Hey @zh217 , ran into a peculiar bug when trying to query documents using an hnsw index. Query and schema below:

Schema:

:create beliefs {
    belief_id: Uuid,
    character_id: Uuid,
    belief: String,
    last_accessed_at: Validity default [floor(now()), true],
    =>
    details: String default "",
    parent_belief_id: Uuid? default null,
    valence: Float default 0,
    aspects: [(String, Float, String, String)] default [],
    belief_embedding: <F32; 768>,
    details_embedding: <F32; 768>,
}

Index:

::hnsw create beliefs:embedding_space {
    dim: 768,
    m: 50,
    dtype: F32,
    fields: [belief_embedding, details_embedding],
    distance: Cosine,
    ef_construction: 20,
    extend_candidates: false,
    keep_pruned_connections: false,
}

This query works as expected (python f-string):

    ?[belief, valence, dist, character_id] := ~beliefs:embedding_space{{ belief, valence, character_id |
        query: vec({query_embedding}),
        k: {n},
        ef: 20,
        radius: {radius},
        bind_distance: dist
    }}

    :order -valence
    :order dist

But this query fails (python f-string):

    ?[belief, valence, dist, character_id, vector] := ~beliefs:embedding_space{{ belief, valence, character_id |
        query: vec({query_embedding}),
        k: {n},
        ef: 20,
        radius: {radius},
        bind_distance: dist,
        bind_vector: vector
    }}

    :order -valence
    :order dist

Error:

QueryException: corrupted index

Stacktrace:

---------------------------------------------------------------------------
QueryException                            Traceback (most recent call last)
Cell In[70], line 134
    132 chatml = [{"role": "user", "name": "Diwank", "content": "I like reading sometimes"}]
    133 # to_belief_chatml_msg(get_matching_beliefs(chatml, 0.8))
--> 134 get_matching_beliefs(chatml)

Cell In[70], line 76, in get_matching_beliefs(chatml, confidence, n, k, window_size, character_ids, exclude_roles)
     61 hnsw_query = dedent(f"""
     62 ?[belief, valence, dist, character_id, vector] := ~beliefs:embedding_space{{ belief, valence, character_id |
     63     query: vec({query_embedding}),
   (...)
     72 :order dist
     73 """)
     75 # Embedding seach results
---> 76 hnsw_results = cozo_client.run(hnsw_query)
     78 # Now mix up for character_ids
     79 groups = [
     80     group.tolist()
     81     for _, group
     82     in hnsw_results.groupby(["character_id"])["belief"]
     83 ]

File ~/.cache/pypoetry/virtualenvs/memory-lD2YGiAV-py3.10/lib/python3.10/site-packages/pycozo/client.py:111, in Client.run(self, script, params, immutable)
    104 """Run a given CozoScript query.
    105 
    106 :param script: the query in CozoScript
    107 :param params: the named parameters for the query. If specified, must be a dict with string keys.
    108 :return: the query result as a dict, or a pandas dataframe if the `dataframe` option was true.
    109 """
    110 if self.embedded is None:
--> 111     return self._client_request(script, params, immutable)
    112 else:
    113     return self._embedded_request(script, params, immutable)

File ~/.cache/pypoetry/virtualenvs/memory-lD2YGiAV-py3.10/lib/python3.10/site-packages/pycozo/client.py:82, in Client._client_request(self, script, params, immutable)
     76 r = requests.post(f'{self.host}/text-query', headers=self._headers(), json={
     77     'script': script,
     78     'params': params or {},
     79     'immutable': immutable
     80 })
     81 res = r.json()
---> 82 return self._format_return(res)

File ~/.cache/pypoetry/virtualenvs/memory-lD2YGiAV-py3.10/lib/python3.10/site-packages/pycozo/client.py:86, in Client._format_return(self, res)
     84 def _format_return(self, res):
     85     if not res['ok']:
---> 86         raise QueryException(res)
     88     if self.pandas:
     89         return self.pandas.DataFrame(columns=res['headers'], data=res['rows'])

QueryException: corrupted index
zh217 commented 1 year ago

Thanks for the report! I have reproduced the problem and will be investigating it.

creatorrr commented 1 year ago

@zh217 is this fix landed in 0.7.2 ?

zh217 commented 1 year ago

It is fixed in the dev branch

creatorrr commented 1 year ago

Gotcha. I’ll try and build from dev. Please feel free to close this issue

creatorrr commented 1 year ago

@zh217 any timeline for when this will land in the stable binaries?