elastic / eland

Python Client and Toolkit for DataFrames, Big Data, Machine Learning and ETL in Elasticsearch
https://eland.readthedocs.io
Apache License 2.0
627 stars 98 forks source link

Cannot append fields of type "dense_vector" to an existing index #659

Open walkingmug opened 5 months ago

walkingmug commented 5 months ago

Description: When trying to append a pandas dataframe of type "dense_vector" to an existing elastic index with the same field type, an error occurs.

Reproduction:

  1. Install requirements: pip install elasticsearch eland pandas numpy
  2. Imports:
    from elasticsearch import Elasticsearch
    import eland as ed
    import pandas as pd
    import numpy as np
  3. Connect to Elasticsearch:
    client = Elasticsearch(HOST, timeout=120)
  4. Create vector dataframes:
    
    vector1 = np.random.rand(512)
    vector2 = np.random.rand(512)
    df_1 = pd.DataFrame({
    'vector_column': [vector1, vector2]
    })

vector3 = np.random.rand(512) vector4 = np.random.rand(512) df_2 = pd.DataFrame({ 'vector_column': [vector3, vector4] })

5.  ✅ Upload first dataframe:

upload df_1 to elasticsearch

ed.pandas_to_eland( pd_df=df_1, es_client=client, es_dest_index='test-upload', es_if_exists="append", es_refresh=True, es_type_overrides={ "vector_column": { "type": "dense_vector", "dims": 512, "index": True, "similarity": "cosine" }, }, chunksize=100 )

6. ❌ Append second dataframe to first dataframe:

upload df_2 to elasticsearch

ed.pandas_to_eland( pd_df=df_2, es_client=client, es_dest_index='test-upload', es_if_exists="append", es_refresh=True, es_type_overrides={ "vector_column": { "type": "dense_vector", "dims": 512, "index": True, "similarity": "cosine" }, }, chunksize=100 )

Error:

TypeError Traceback (most recent call last) in <cell line: 2>() 1 # upload df_2 to elasticsearch ----> 2 ed.pandas_to_eland( 3 pd_df=df_2, 4 es_client=client, 5 es_dest_index='test-upload',

1 frames /usr/local/lib/python3.10/dist-packages/eland/field_mappings.py in verify_mapping_compatibility(ed_mapping, es_mapping, es_type_overrides) 919 key_type = es_type_overrides.get(key, key_def["type"]) 920 es_key_type = es_props[key]["type"] --> 921 if key_type != es_key_type and es_key_type not in ES_COMPATIBLE_TYPES.get( 922 key_type, () 923 ):

TypeError: unhashable type: 'dict'

pquentin commented 4 months ago

Thanks for your bug report! I liked how it was minimal and easy to reproduce locally, which allowed me to confirm the issue.

What happens is that the first append simply uploads data to a new index, while the second has to check the existing mappings, which hits a different code path. While we should not fail with a TypeError, Eland does not currently support dense_vector, which is the crux of the issue.