J535D165 / recordlinkage

A powerful and modular toolkit for record linkage and duplicate detection in Python
http://recordlinkage.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
966 stars 152 forks source link

Support for pandas datatypes #179

Open devmcp opened 2 years ago

devmcp commented 2 years ago

Pandas datatypes, such as pd.Int64Dtype (see here), do not seem to be supported:

import recordlinkage
from recordlinkage.datasets import load_febrl4

dfA, dfB = load_febrl4()

# Convert column types to pandas nullable integer (Int64):
dfA.postcode = pd.to_numeric(dfA.postcode).convert_dtypes()
dfB.postcode = pd.to_numeric(dfB.postcode).convert_dtypes()

# Indexation step
indexer = recordlinkage.Index()
indexer.block("given_name")
candidate_links = indexer.index(dfA, dfB)

# Comparison step
compare_cl = recordlinkage.Compare()
compare_cl.numeric("postcode", "postcode", label="postcode")

features = compare_cl.compute(candidate_links, dfA, dfB)

gives the error:

TypeError: Cannot interpret 'Int64Dtype()' as a data type