dedupeio / dedupe

:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.
https://docs.dedupe.io
MIT License
4.16k stars 551 forks source link

Reference plugin variables using "module:class" strings #1122

Closed NickCrews closed 5 months ago

NickCrews commented 1 year ago

Will close https://github.com/dedupeio/dedupe/issues/1085

This probably could do with some more polishing, eg removing type from every Variable class, and swapping if variable_type == "Interaction" to parsing the variable type, then checkingif isinstance(variable_type, dedupe.variables.InteractionType)

coveralls commented 1 year ago

Coverage Status

Coverage remained the same at 74.136% when pulling bad32f667c137eeb41126832d5f6a9f092741c1a on NickCrews:cusotm-vars-str-spec into b3efd1e3cd50c2974d08e77a5f94c418927ffd8f on dedupeio:main.

fgregg commented 1 year ago

i like the simplicity of this approach, but it will be a breaking change for data models that include datetime. in this code this is only working because of the surviving name spacing bits we are trying to get away from.

it will also be a be a breaking change for data models that include the name, address, and fuzzy category types that are not bundled with dedupe but are recommended in the docs

while this pattern is nicer for devs, i don’t like that if a library’s file structure changes, that can break a downstream users code. the file structure should not have to be part of an external API (though i grant that it often is)

NickCrews commented 1 year ago

i like the simplicity of this approach, but it will be a breaking change for data models that include datetime. in this code this is only working because of the surviving name spacing bits we are trying to get away from.

Hmmm, you're right. Two possible solutions I see:

  1. Merge the datetime package into this library. It's already listed as a dep of dedupe, and I assume that it isn't used by some 3rd package. I don't understand why datetime is given this special treatment but name, address, and fuzzycategory are not?
  2. Update the datetime package so that it exports itself in a more normal way. Then here in variables/__init__.py we can say from dedupe_variable_datetime import DateTimeType as DateTime

it will also be a breaking change for data models that include the name, address, and fuzzy category types that are not bundled with dedupe but are recommended in the docs

Hmm, that's a good point. I was just thinking of breaking changes to people using their own custom plugins. I think the best option here is temporarily have a hardcoded mapping for these edge cases (eg "FuzzyCategorical"->"dedupe_variable_fuzzycategory:FuzzyCategorical"), and if you use them then you get a deprecation warning. We can remove them later, after users get a chance to switch. I think eventually they should follow the same rules as other 3rd party plugin variables, since they are external and need to be pip installed.

while this pattern is nicer for devs, i don’t like that if a library’s file structure changes, that can break a downstream users code. the file structure should not have to be part of an external API (though i grant that it often is)

That's a good point. I think that is the responsibility of the plugin authors and is easily avoided. e.g. in a zipcodelib package, in the root __init__.py, they should from zipcodelib.foo.bar import ZipCodeVariable, so that users can just use "zipcodelib:ZipVariable". Then the plugin author is free to move ZipCodeVariable from zipcodelib/foo/bar.py to zipcodelib/baz.py, adjust the import in the root __init__.py, and everything still works.

fgregg commented 5 months ago

closed by #1193