Deduper should be working with np.nan as well

Michael-E-Rose commented 6 years ago

After fiddling myself it seems deduper expects missings to be coded as None. First of all, I believe this information should be part of http://dedupeio.github.io/dedupe-examples/docs/csv_example.html.

However, many applications use pandas DataFrames, which codes missings as NaN from numpy. I think dedupe should be able to handle both.

Below is the error message using data from recordlinkage:

import dedupe
import pandas as pd
from recordlinkage.datasets import load_febrl1

fields = [{'field': 'given_name', 'type': 'String', 'has missing': True},
          {'field': 'surname', 'type': 'String', 'has missing': True},
          {'field': 'street_number', 'type': 'ShortString', 'has missing': True},
          {'field': 'address_1', 'type': 'String', 'has missing': True},
          {'field': 'address_2', 'type': 'String', 'has missing': True},
          {'field': 'suburb', 'type': 'String', 'has missing': True},
          {'field': 'postcode', 'type': 'ShortString'},
          {'field': 'state', 'type': 'ShortString', 'has missing': True},
          {'field': 'date_of_birth', 'type': 'String', 'has missing': True},
          {'field': 'soc_sec_id', 'type': 'String'}]
deduper = dedupe.Dedupe(fields)
data_d = df.to_dict(orient='index')
deduper.sample(data_d)

results in

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.5/dist-packages/dedupe/api.py", line 811, in sample
    self.active_learner.sample_combo(data, blocked_proportion, sample_size)
  File "/usr/local/lib/python3.5/dist-packages/dedupe/labeler.py", line 151, in sample_combo
    super(RLRLearner, self).sample_combo(*args)
  File "/usr/local/lib/python3.5/dist-packages/dedupe/labeler.py", line 38, in sample_combo
    data)
  File "/usr/local/lib/python3.5/dist-packages/dedupe/sampling.py", line 23, in blockedSample
    *args))
  File "/usr/local/lib/python3.5/dist-packages/dedupe/sampling.py", line 62, in dedupeSamplePredicates
    items)
  File "/usr/local/lib/python3.5/dist-packages/dedupe/sampling.py", line 81, in dedupeSamplePredicate
    block_keys = predicate_function(column)
  File "/usr/local/lib/python3.5/dist-packages/dedupe/predicates.py", line 308, in fingerprint
    return (u''.join(sorted(field.split())).strip(),)
AttributeError: 'float' object has no attribute 'split'

BTW, when I code missings as empty string, there's a strange error to:

data_d = df.fillna("").to_dict(orient='index')
deduper.sample(data_d)

results in

/usr/local/lib/python3.5/dist-packages/dedupe/sampling.py:39: UserWarning: 7500 blocked samples were requested, but only able to sample 7414
  % (sample_size, len(blocked_sample)))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.5/dist-packages/dedupe/api.py", line 811, in sample
    self.active_learner.sample_combo(data, blocked_proportion, sample_size)
  File "/usr/local/lib/python3.5/dist-packages/dedupe/labeler.py", line 151, in sample_combo
    super(RLRLearner, self).sample_combo(*args)
  File "/usr/local/lib/python3.5/dist-packages/dedupe/labeler.py", line 49, in sample_combo
    self.distances = self.transform(self.candidates)
  File "/usr/local/lib/python3.5/dist-packages/dedupe/labeler.py", line 89, in transform
    return self.data_model.distances(pairs)
  File "/usr/local/lib/python3.5/dist-packages/dedupe/datamodel.py", line 82, in distances
    record_2[field])
  File "affinegap/affinegap.pyx", line 115, in affinegap.affinegap.normalizedAffineGapDistance (affinegap/affinegap.c:1991)
  File "affinegap/affinegap.pyx", line 134, in affinegap.affinegap.normalizedAffineGapDistance (affinegap/affinegap.c:1824)
ZeroDivisionError: float division

fgregg commented 6 years ago

https://dedupe.io/developers/library/en/latest/Variable-definition.html#missing-data

Michael-E-Rose commented 6 years ago

Thanks for the link. I also take it as No to my suggestion that deduper should be able to handle NaN's.

fgregg commented 6 years ago

I do think it's better to have a single way of encoding missing data that works for all data types.

dedupeio / dedupe-examples

Deduper should be working with np.nan as well #71