elastic / elasticsearch-dsl-py

High level Python client for Elasticsearch
http://elasticsearch-dsl.readthedocs.org
Apache License 2.0
3.84k stars 800 forks source link

[bug] DenseVector bit type support for 8.16 #1946

Open pySilver opened 14 hours ago

pySilver commented 14 hours ago

Here is a valid example of bit dense_vector field with element_type = 'bit' where values are hex:

PUT /my-index
{
  "mappings": {
    "properties": {
      "my_vector": {
        "type": "dense_vector",
        "element_type": "bit",
        "dims": 256
      }
    }
  }
}

PUT /my-index/_doc/c1.jpg
{"my_vector": "eb80b56a847f4a957fa0b56ac05fdaad16ac6b522d43952cc0de6ab53fa0894a"}

This type become available in 8.16 release of ES, I believe.

So I'm getting serialization error when trying to use that vector type:

class ImageFeatures(InnerDoc):
    phash_vector = DenseVector(
        dims=64,
        element_type="bit",
        required=True,
    )

Error (shown when validation is enabled):

File "/Users/Silver/Projects/GitHub/mybaze/mybaze/feeds/services.py", line 705, in products_sync_to_elasticsearch
    return await ProductDocument.bulk(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/Silver/Projects/GitHub/mybaze/.venv/lib/python3.12/site-packages/elasticsearch_dsl/_async/document.py", line 521, in bulk
    return await async_bulk(es, Generate(actions), **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/Silver/Projects/GitHub/mybaze/.venv/lib/python3.12/site-packages/elasticsearch/_async/helpers.py", line 346, in async_bulk
    async for ok, item in async_streaming_bulk(
  File "/Users/Silver/Projects/GitHub/mybaze/.venv/lib/python3.12/site-packages/elasticsearch/_async/helpers.py", line 237, in async_streaming_bulk
    async for bulk_data, bulk_actions in _chunk_actions(
  File "/Users/Silver/Projects/GitHub/mybaze/.venv/lib/python3.12/site-packages/elasticsearch/_async/helpers.py", line 79, in _chunk_actions
    async for action, data in actions:
  File "/Users/Silver/Projects/GitHub/mybaze/.venv/lib/python3.12/site-packages/elasticsearch/_async/helpers.py", line 225, in map_actions
    async for item in aiter(actions):
  File "/Users/Silver/Projects/GitHub/mybaze/.venv/lib/python3.12/site-packages/elasticsearch_dsl/_async/document.py", line 515, in __anext__
    doc.full_clean()
  File "/Users/Silver/Projects/GitHub/mybaze/.venv/lib/python3.12/site-packages/elasticsearch_dsl/utils.py", line 642, in full_clean
    self.clean_fields(validate=False)
  File "/Users/Silver/Projects/GitHub/mybaze/.venv/lib/python3.12/site-packages/elasticsearch_dsl/utils.py", line 628, in clean_fields
    data = field.clean(data)
           ^^^^^^^^^^^^^^^^^
  File "/Users/Silver/Projects/GitHub/mybaze/.venv/lib/python3.12/site-packages/elasticsearch_dsl/field.py", line 264, in clean
    data = super().clean(data)
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/Silver/Projects/GitHub/mybaze/.venv/lib/python3.12/site-packages/elasticsearch_dsl/field.py", line 148, in clean
    data = self.deserialize(data)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/Silver/Projects/GitHub/mybaze/.venv/lib/python3.12/site-packages/elasticsearch_dsl/field.py", line 138, in deserialize
    None if d is None else self._deserialize(d)
                           ^^^^^^^^^^^^^^^^^^^^
  File "/Users/Silver/Projects/GitHub/mybaze/.venv/lib/python3.12/site-packages/elasticsearch_dsl/field.py", line 249, in _deserialize
    return self._wrap(data)
           ^^^^^^^^^^^^^^^^
  File "/Users/Silver/Projects/GitHub/mybaze/.venv/lib/python3.12/site-packages/elasticsearch_dsl/field.py", line 226, in _wrap
    return self._doc_class.from_es(data, data_only=True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/Silver/Projects/GitHub/mybaze/.venv/lib/python3.12/site-packages/elasticsearch_dsl/document_base.py", line 379, in from_es
    return super().from_es(data)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/Silver/Projects/GitHub/mybaze/.venv/lib/python3.12/site-packages/elasticsearch_dsl/utils.py", line 561, in from_es
    doc._from_dict(data)
  File "/Users/Silver/Projects/GitHub/mybaze/.venv/lib/python3.12/site-packages/elasticsearch_dsl/utils.py", line 568, in _from_dict
    v = f.deserialize(v)
        ^^^^^^^^^^^^^^^^
  File "/Users/Silver/Projects/GitHub/mybaze/.venv/lib/python3.12/site-packages/elasticsearch_dsl/field.py", line 144, in deserialize
    return self._deserialize(data)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/Silver/Projects/GitHub/mybaze/.venv/lib/python3.12/site-packages/elasticsearch_dsl/field.py", line 249, in _deserialize
    return self._wrap(data)
           ^^^^^^^^^^^^^^^^
  File "/Users/Silver/Projects/GitHub/mybaze/.venv/lib/python3.12/site-packages/elasticsearch_dsl/field.py", line 226, in _wrap
    return self._doc_class.from_es(data, data_only=True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/Silver/Projects/GitHub/mybaze/.venv/lib/python3.12/site-packages/elasticsearch_dsl/document_base.py", line 379, in from_es
    return super().from_es(data)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/Silver/Projects/GitHub/mybaze/.venv/lib/python3.12/site-packages/elasticsearch_dsl/utils.py", line 561, in from_es
    doc._from_dict(data)
  File "/Users/Silver/Projects/GitHub/mybaze/.venv/lib/python3.12/site-packages/elasticsearch_dsl/utils.py", line 568, in _from_dict
    v = f.deserialize(v)
        ^^^^^^^^^^^^^^^^
  File "/Users/Silver/Projects/GitHub/mybaze/.venv/lib/python3.12/site-packages/elasticsearch_dsl/field.py", line 144, in deserialize
    return self._deserialize(data)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/Silver/Projects/GitHub/mybaze/.venv/lib/python3.12/site-packages/elasticsearch_dsl/field.py", line 389, in _deserialize
    return float(data)
           ^^^^^^^^^^^
miguelgrinberg commented 7 hours ago

Ah, yes, the DenseVector class in this package is designed to represent a list of floating point numbers, it isn't going to work as anything else.

Let me think about how to best represent the new dense vector, we may need to add a separate class for them, since the type definitions in this package aren't as flexible as the ones Elasticsearch uses server-side.