dedupeio / dedupe

:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.
https://docs.dedupe.io
MIT License
4.13k stars 551 forks source link

Unhashable type 'list' #474

Closed primoz-k closed 8 years ago

primoz-k commented 8 years ago

I'm retrieving my rows from a Postgresql DB and one of the retrieved columns is a jsonb array. When labeling is completed, I get: TypeError: unhashable type: 'list' in the __hash__() method of a frozendict.

Example of self._d:

`self._d = {'locale': 'United States', 'contactpositions': [31474324], 'languages': 'Polish'}`.

I've tried converting the list to tuple, but then this can happen:

 `self._d = {'locale': 'United States', 'contactpositions': ({'id': 31474324, 'position': 'ceo'}, {...}), 'languages': 'Polish'}`.

The problem, as you can see, is with contactpositions. Are these types not supported and is there any way I can still use them?

fgregg commented 8 years ago

Right now, frozendict assumes that all the values of original dictionary are hashable.

Let me give a little background on why we have frozendict at all.

When dedupe learns blocking rules, it keeps track of pairs of records that a blocking rule covers. This is done through building sets of pairs of records. In order to use python sets, the objects must be hashable, and so the records must be hashable.

It's possible to have a different design and keep track of hashable ids that refer to the records. I've tried this a few times, and it added a lot of complexity to the design.

Okay, so there are three ways forward

  1. You can, in your code, make sure that the values in the original dictionaries are all hashable
  2. The frozendict hash method could be modified to deal with unhashable values (by casting them to something hashable, probably)
  3. We could try, again, to remove frozendict and use references to records. This may be more possible now, since I've recently rewrote a good portion of the block learning code.

If you wanted to work on this, I would say the second option is probably the best.

primoz-k commented 8 years ago

Perfect. In the meantime I have already modified the hash method which now deals with lists and dictionaries so that they are now casted into tuples.

I am just glad I went into the right direction and will submit PR after I write this method a bit more bulletproof if you want.

fgregg commented 8 years ago

I'd like to see a PR, for sure. Probably needs a recursive design.

fgregg commented 8 years ago

I finally excised the necessity for the records to be hashable, c4c67bba25c3f53d0668cf13016a32df38c0c10c