Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.5/dist-packages/dedupe/api.py", line 811, in sample
self.active_learner.sample_combo(data, blocked_proportion, sample_size)
File "/usr/local/lib/python3.5/dist-packages/dedupe/labeler.py", line 151, in sample_combo
super(RLRLearner, self).sample_combo(*args)
File "/usr/local/lib/python3.5/dist-packages/dedupe/labeler.py", line 38, in sample_combo
data)
File "/usr/local/lib/python3.5/dist-packages/dedupe/sampling.py", line 23, in blockedSample
*args))
File "/usr/local/lib/python3.5/dist-packages/dedupe/sampling.py", line 62, in dedupeSamplePredicates
items)
File "/usr/local/lib/python3.5/dist-packages/dedupe/sampling.py", line 81, in dedupeSamplePredicate
block_keys = predicate_function(column)
File "/usr/local/lib/python3.5/dist-packages/dedupe/predicates.py", line 308, in fingerprint
return (u''.join(sorted(field.split())).strip(),)
AttributeError: 'float' object has no attribute 'split'
BTW, when I code missings as empty string, there's a strange error to:
/usr/local/lib/python3.5/dist-packages/dedupe/sampling.py:39: UserWarning: 7500 blocked samples were requested, but only able to sample 7414
% (sample_size, len(blocked_sample)))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.5/dist-packages/dedupe/api.py", line 811, in sample
self.active_learner.sample_combo(data, blocked_proportion, sample_size)
File "/usr/local/lib/python3.5/dist-packages/dedupe/labeler.py", line 151, in sample_combo
super(RLRLearner, self).sample_combo(*args)
File "/usr/local/lib/python3.5/dist-packages/dedupe/labeler.py", line 49, in sample_combo
self.distances = self.transform(self.candidates)
File "/usr/local/lib/python3.5/dist-packages/dedupe/labeler.py", line 89, in transform
return self.data_model.distances(pairs)
File "/usr/local/lib/python3.5/dist-packages/dedupe/datamodel.py", line 82, in distances
record_2[field])
File "affinegap/affinegap.pyx", line 115, in affinegap.affinegap.normalizedAffineGapDistance (affinegap/affinegap.c:1991)
File "affinegap/affinegap.pyx", line 134, in affinegap.affinegap.normalizedAffineGapDistance (affinegap/affinegap.c:1824)
ZeroDivisionError: float division
After fiddling myself it seems deduper expects missings to be coded as
None
. First of all, I believe this information should be part of http://dedupeio.github.io/dedupe-examples/docs/csv_example.html.However, many applications use pandas DataFrames, which codes missings as
NaN
from numpy. I think dedupe should be able to handle both.Below is the error message using data from
recordlinkage
:results in
BTW, when I code missings as empty string, there's a strange error to:
results in