alteryx / woodwork

Woodwork is a Python library that provides robust methods for managing and communicating data typing information.
https://woodwork.alteryx.com
BSD 3-Clause "New" or "Revised" License
145 stars 20 forks source link

Define `Embedding` logical type for vector data #1871

Open Rhett-Ying opened 1 month ago

Rhett-Ying commented 1 month ago

Hi,

I am wondering if it's possible to add self-defined Embedding logical type into ww which represents vector data? I tried with below code but failed.

import pandas as pd
import numpy as np
import woodwork as ww
from woodwork.logical_types import LogicalType

class Embedding(LogicalType):
    primary_dtype = 'object'
    standard_tags = {'embedding', 'numeric'}

ww.type_system.add_type(Embedding)

df = pd.DataFrame(
    {
        "id": [0, 1, 2, 3],
        "code": ["012345412359", "122345712358", "012345412359", "012345412359"],
        'embedding_0': [np.array([1, 2, 3]), np.array([2, 3, 4]), np.array([3, 4, 5]), np.array([4, 5, 6])],
        'embedding_1': [[1, 2, 3], [2, 3, 4], [3, 4, 5], [4, 5, 6]],
    }
)

with ww.config.with_options():
    df.ww.init()
df.ww