dedupeio / dedupe

:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.
https://docs.dedupe.io
MIT License
4.16k stars 551 forks source link

purify id types #1138

Closed fgregg closed 1 year ago

fgregg commented 1 year ago

When I was using dedupe in another project, i noticed that we had, in many places, said we expected record ids to be a mixture of integers or strings. In reality, we need the record ids to either be purely integers or purely strings. This fixes that typing.

It's not quite done, because it is possible for the two datasets in RecordLinkage and Gazetteer code to have different id types, and i haven't handled that possibility yet.

closes #1136

codecov[bot] commented 1 year ago

Codecov Report

Base: 73.66% // Head: 73.84% // Increases project coverage by +0.18% :tada:

Coverage data is based on head (b9b1768) compared to base (f0503e0). Patch coverage: 80.00% of modified lines in pull request are covered.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #1138 +/- ## ========================================== + Coverage 73.66% 73.84% +0.18% ========================================== Files 28 28 Lines 2221 2294 +73 ========================================== + Hits 1636 1694 +58 - Misses 585 600 +15 ``` | [Impacted Files](https://codecov.io/gh/dedupeio/dedupe/pull/1138?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dedupeio) | Coverage Δ | | |---|---|---| | [dedupe/api.py](https://codecov.io/gh/dedupeio/dedupe/pull/1138?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dedupeio#diff-ZGVkdXBlL2FwaS5weQ==) | `44.44% <64.00%> (+1.99%)` | :arrow_up: | | [dedupe/labeler.py](https://codecov.io/gh/dedupeio/dedupe/pull/1138?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dedupeio#diff-ZGVkdXBlL2xhYmVsZXIucHk=) | `76.24% <70.58%> (-0.47%)` | :arrow_down: | | [dedupe/training.py](https://codecov.io/gh/dedupeio/dedupe/pull/1138?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dedupeio#diff-ZGVkdXBlL3RyYWluaW5nLnB5) | `63.49% <72.00%> (+0.56%)` | :arrow_up: | | [dedupe/\_typing.py](https://codecov.io/gh/dedupeio/dedupe/pull/1138?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dedupeio#diff-ZGVkdXBlL190eXBpbmcucHk=) | `92.95% <100.00%> (+2.39%)` | :arrow_up: | | [dedupe/canonical.py](https://codecov.io/gh/dedupeio/dedupe/pull/1138?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dedupeio#diff-ZGVkdXBlL2Nhbm9uaWNhbC5weQ==) | `96.96% <100.00%> (ø)` | | | [dedupe/convenience.py](https://codecov.io/gh/dedupeio/dedupe/pull/1138?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dedupeio#diff-ZGVkdXBlL2NvbnZlbmllbmNlLnB5) | `35.93% <100.00%> (+4.27%)` | :arrow_up: | | [dedupe/datamodel.py](https://codecov.io/gh/dedupeio/dedupe/pull/1138?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dedupeio#diff-ZGVkdXBlL2RhdGFtb2RlbC5weQ==) | `88.11% <100.00%> (ø)` | | Help us with your feedback. Take ten seconds to tell us [how you rate us](https://about.codecov.io/nps?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dedupeio). Have a feature suggestion? [Share it here.](https://app.codecov.io/gh/feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dedupeio)

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

fgregg commented 1 year ago

It's not quite done, because it is possible for the two datasets in RecordLinkage and Gazetteer code to have different id types, and i haven't handled that possibility yet.

actually, because of the way that the distance functions interact with numpy, the id types do need to be homogenous.