Lyonk71 / pandas-dedupe

Simplifies use of the Dedupe library via Pandas
135 stars 30 forks source link

pandas-dedupe==1.5.0 not compatible with dedupe>=3.0 (released on 27th June 2024) #64

Open gildastone opened 1 month ago

gildastone commented 1 month ago

pandas-dedupe install the latest version of dedupe which is 3.0.3 as of now. However, when defining the field_properties in df_final = pandas_dedupe.dedupe_dataframe(df=df, field_properties=[...]), the following error is raised by dedupe:

File "/.../lib/python3.11/site-packages/dedupe/api.py", line 1141, in init self.data_model = datamodel.DataModel(variable_definition) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/.../lib/python3.11/site-packages/dedupe/datamodel.py", line 32, in init raise ValueError( ValueError: It looks like you are trying to use a variable definition composed of dictionaries. dedupe 3.0 uses variable objects directly. So instead of [{"field": "name", "type": "String"}] we now do [dedupe.variables.String("name")].

A quick and dirty fix I did to use dedupe>=3.0.3 (just to unblock myself) is to update the utility function pandas_dedupe.utility_functions.select_fields(fields, field_properties)(link) with:

if isinstance(i, String):
    fields.append(i)

Where i is of type dedupe.variables.String instead of:

if type(i)==str:
    fields.append({'field': i, 'type': 'String'})

Last commit in this project dates from 4 years. Any plans to upgrade the package to be compatible with dedupe>=3.0 and drop compatibility with older versions? Any help needed?