ParticularMiner / red_string_grouper

Record Equivalence Discoverer based on String Grouper
MIT License
4 stars 2 forks source link

Defaults and keyword argument format now available #2

Open ParticularMiner opened 2 years ago

ParticularMiner commented 2 years ago

@iibarant @berndnoll

Defaults and keyword argument format now available for Red String Grouper. Weights, for instance, have a default value of 1.0 as requested by @berndnoll. Here's how a sample call would now look like:

from red_string_grouper import record_linkage, field

matches = record_linkage(
    df,
    fields_2b_matched_fuzzily=[
        field('statusText'),
        field('address'),
        field('addressZipcode', weight=2, min_similarity=0.9999)
    ],
    fields_2b_matched_exactly=[
        field('addressState', weight=4),
        field('hasVideo')
    ]
)

I'd appreciate it if you or anyone else would take a look at these new changes and test that they are working as prescribed. Thanks.

iibarant commented 2 years ago

Hi @ParticularMiner, that is a nice update. However, I copied the latest version and it shows an error at the very last line. Is it needed as you already have a return statement? image

I also noticed that similarity_dtype=np.float32 is missing to call the record_linkage. Is it not going to be used? Another question is regarding ngrams - is it static 3 going forward? Can it be a part of the call? The good exercise would be to explore effect ngrams value (3-4-5) on the performance.

Thank you very much for your hard work.

ParticularMiner commented 2 years ago

Thanks @iibarant

That's a careless mistake. I'll update the code shortly. In the meantime, yes, you may remove it. Cheers

iibarant commented 2 years ago

Thanks @iibarant

That's a careless mistake. I'll update the code shortly. In the meantime, yes, you may remove it. Cheers

Please see my addition in the previous comment.

ParticularMiner commented 2 years ago

@iibarant

Package has now been updated to version 0.0.8 with the correction.

To answer your other questions — all other options are still there. You only need to use the string_grouper names for these options. (follow this link). For instance,

similarity_dtype is now tfidf_matrix_dtype. ngram_size remains the same.

You can specify any option within the auxilliary function field(), or directly in record_linkage() in which case the options apply to all fields. However, options already specified for a field take precedence over those directly specified in record_linkage(). For example,

matches = record_linkage(
    df,
    fields_2b_matched_fuzzily=[
        field('statusText'),
        field('address'),
        field('addressZipcode', weight=2, min_similarity=0.9999),
        field('addressState', weight=4, ngram_size=2, min_similarity=0.9999),
        field('hasVideo', min_similarity=0.9999)
    ],
    n_blocks=(1,1),
    tfidf_matrix_dtype=np.float32
)

I hope this is clear.

ParticularMiner commented 2 years ago

@iibarant

I would be interested to know your results of explorations into varying ngram_size

iibarant commented 2 years ago

@ParticularMiner, For all the same parameters except ngrams, for a data set of 131056 records, there is following runtime breakdown

ngrams runtime (sec) 3 - 131.266 4 - 116.601 5 - 110.756

ParticularMiner commented 2 years ago

@iibarant Good. I suspected that would be the case, since the matrices become sparser and easier to compute.
If you like, try increasing the n_blocks parameter to say n_blocks=(1, 50).

iibarant commented 2 years ago

Hi @ParticularMiner,

Here we go on a dataframe of 33788 records:

n = 3 n_blocks = 1 - Runtime (sec): 7.31 n = 3 n_blocks = 2 - Runtime (sec): 7.15 n = 3 n_blocks = 3 - Runtime (sec): 7.4 n = 3 n_blocks = 4 - Runtime (sec): 8.79 n = 3 n_blocks = 5 - Runtime (sec): 8.22 n = 3 n_blocks = 6 - Runtime (sec): 8.37 n = 3 n_blocks = 7 - Runtime (sec): 8.23 n = 3 n_blocks = 8 - Runtime (sec): 8.4 n = 3 n_blocks = 9 - Runtime (sec): 8.33 n = 3 n_blocks = 10 - Runtime (sec): 8.32 n = 3 n_blocks = 11 - Runtime (sec): 8.37 n = 3 n_blocks = 12 - Runtime (sec): 8.35 n = 3 n_blocks = 13 - Runtime (sec): 8.4 n = 3 n_blocks = 14 - Runtime (sec): 8.38 n = 3 n_blocks = 15 - Runtime (sec): 8.95 n = 3 n_blocks = 16 - Runtime (sec): 9.03 n = 3 n_blocks = 17 - Runtime (sec): 9.12 n = 3 n_blocks = 18 - Runtime (sec): 9.14 n = 3 n_blocks = 19 - Runtime (sec): 9.29 n = 3 n_blocks = 20 - Runtime (sec): 9.31 n = 3 n_blocks = 21 - Runtime (sec): 9.27 n = 3 n_blocks = 22 - Runtime (sec): 9.45 n = 3 n_blocks = 23 - Runtime (sec): 9.5 n = 3 n_blocks = 24 - Runtime (sec): 9.54 n = 3 n_blocks = 25 - Runtime (sec): 9.72 n = 3 n_blocks = 26 - Runtime (sec): 9.64 n = 3 n_blocks = 27 - Runtime (sec): 9.79 n = 3 n_blocks = 28 - Runtime (sec): 9.78 n = 3 n_blocks = 29 - Runtime (sec): 8.91 n = 3 n_blocks = 30 - Runtime (sec): 9.07 n = 3 n_blocks = 31 - Runtime (sec): 9.04 n = 3 n_blocks = 32 - Runtime (sec): 9.0 n = 3 n_blocks = 33 - Runtime (sec): 8.75 n = 3 n_blocks = 34 - Runtime (sec): 8.83 n = 3 n_blocks = 35 - Runtime (sec): 8.84 n = 3 n_blocks = 36 - Runtime (sec): 9.01 n = 3 n_blocks = 37 - Runtime (sec): 8.97 n = 3 n_blocks = 38 - Runtime (sec): 8.88 n = 3 n_blocks = 39 - Runtime (sec): 8.89 n = 3 n_blocks = 40 - Runtime (sec): 8.95 n = 3 n_blocks = 41 - Runtime (sec): 9.05 n = 3 n_blocks = 42 - Runtime (sec): 9.03 n = 3 n_blocks = 43 - Runtime (sec): 9.08 n = 3 n_blocks = 44 - Runtime (sec): 9.14 n = 3 n_blocks = 45 - Runtime (sec): 9.07 n = 3 n_blocks = 46 - Runtime (sec): 9.06 n = 3 n_blocks = 47 - Runtime (sec): 9.08 n = 3 n_blocks = 48 - Runtime (sec): 9.06 n = 3 n_blocks = 49 - Runtime (sec): 9.13 n = 3 n_blocks = 50 - Runtime (sec): 9.23 n = 4 n_blocks = 1 - Runtime (sec): 7.19 n = 4 n_blocks = 2 - Runtime (sec): 7.1 n = 4 n_blocks = 3 - Runtime (sec): 7.13 n = 4 n_blocks = 4 - Runtime (sec): 7.15 n = 4 n_blocks = 5 - Runtime (sec): 7.39 n = 4 n_blocks = 6 - Runtime (sec): 7.16 n = 4 n_blocks = 7 - Runtime (sec): 7.19 n = 4 n_blocks = 8 - Runtime (sec): 7.28 n = 4 n_blocks = 9 - Runtime (sec): 7.21 n = 4 n_blocks = 10 - Runtime (sec): 7.3 n = 4 n_blocks = 11 - Runtime (sec): 7.4 n = 4 n_blocks = 12 - Runtime (sec): 7.26 n = 4 n_blocks = 13 - Runtime (sec): 7.25 n = 4 n_blocks = 14 - Runtime (sec): 7.28 n = 4 n_blocks = 15 - Runtime (sec): 7.85 n = 4 n_blocks = 16 - Runtime (sec): 7.91 n = 4 n_blocks = 17 - Runtime (sec): 8.01 n = 4 n_blocks = 18 - Runtime (sec): 8.12 n = 4 n_blocks = 19 - Runtime (sec): 8.15 n = 4 n_blocks = 20 - Runtime (sec): 8.19 n = 4 n_blocks = 21 - Runtime (sec): 8.36 n = 4 n_blocks = 22 - Runtime (sec): 8.65 n = 4 n_blocks = 23 - Runtime (sec): 8.49 n = 4 n_blocks = 24 - Runtime (sec): 8.52 n = 4 n_blocks = 25 - Runtime (sec): 8.58 n = 4 n_blocks = 26 - Runtime (sec): 8.7 n = 4 n_blocks = 27 - Runtime (sec): 8.74 n = 4 n_blocks = 28 - Runtime (sec): 8.7 n = 4 n_blocks = 29 - Runtime (sec): 8.2 n = 4 n_blocks = 30 - Runtime (sec): 7.89 n = 4 n_blocks = 31 - Runtime (sec): 7.95 n = 4 n_blocks = 32 - Runtime (sec): 7.93 n = 4 n_blocks = 33 - Runtime (sec): 7.64 n = 4 n_blocks = 34 - Runtime (sec): 7.72 n = 4 n_blocks = 35 - Runtime (sec): 7.71 n = 4 n_blocks = 36 - Runtime (sec): 7.72 n = 4 n_blocks = 37 - Runtime (sec): 7.71 n = 4 n_blocks = 38 - Runtime (sec): 7.75 n = 4 n_blocks = 39 - Runtime (sec): 7.86 n = 4 n_blocks = 40 - Runtime (sec): 7.95 n = 4 n_blocks = 41 - Runtime (sec): 7.95 n = 4 n_blocks = 42 - Runtime (sec): 7.91 n = 4 n_blocks = 43 - Runtime (sec): 8.06 n = 4 n_blocks = 44 - Runtime (sec): 8.0 n = 4 n_blocks = 45 - Runtime (sec): 8.09 n = 4 n_blocks = 46 - Runtime (sec): 8.01 n = 4 n_blocks = 47 - Runtime (sec): 8.15 n = 4 n_blocks = 48 - Runtime (sec): 8.16 n = 4 n_blocks = 49 - Runtime (sec): 8.56 n = 4 n_blocks = 50 - Runtime (sec): 8.18 n = 5 n_blocks = 1 - Runtime (sec): 7.01 n = 5 n_blocks = 2 - Runtime (sec): 6.91 n = 5 n_blocks = 3 - Runtime (sec): 6.89 n = 5 n_blocks = 4 - Runtime (sec): 7.02 n = 5 n_blocks = 5 - Runtime (sec): 6.9 n = 5 n_blocks = 6 - Runtime (sec): 6.98 n = 5 n_blocks = 7 - Runtime (sec): 7.0 n = 5 n_blocks = 8 - Runtime (sec): 7.08 n = 5 n_blocks = 9 - Runtime (sec): 7.04 n = 5 n_blocks = 10 - Runtime (sec): 6.98 n = 5 n_blocks = 11 - Runtime (sec): 7.01 n = 5 n_blocks = 12 - Runtime (sec): 7.14 n = 5 n_blocks = 13 - Runtime (sec): 7.11 n = 5 n_blocks = 14 - Runtime (sec): 7.35 n = 5 n_blocks = 15 - Runtime (sec): 7.91 n = 5 n_blocks = 16 - Runtime (sec): 7.77 n = 5 n_blocks = 17 - Runtime (sec): 8.13 n = 5 n_blocks = 18 - Runtime (sec): 7.99 n = 5 n_blocks = 19 - Runtime (sec): 8.21 n = 5 n_blocks = 20 - Runtime (sec): 8.34 n = 5 n_blocks = 21 - Runtime (sec): 8.05 n = 5 n_blocks = 22 - Runtime (sec): 8.29 n = 5 n_blocks = 23 - Runtime (sec): 8.49 n = 5 n_blocks = 24 - Runtime (sec): 8.45 n = 5 n_blocks = 25 - Runtime (sec): 8.39 n = 5 n_blocks = 26 - Runtime (sec): 8.43 n = 5 n_blocks = 27 - Runtime (sec): 8.43 n = 5 n_blocks = 28 - Runtime (sec): 8.71 n = 5 n_blocks = 29 - Runtime (sec): 7.86 n = 5 n_blocks = 30 - Runtime (sec): 7.88 n = 5 n_blocks = 31 - Runtime (sec): 7.9 n = 5 n_blocks = 32 - Runtime (sec): 8.04 n = 5 n_blocks = 33 - Runtime (sec): 7.72 n = 5 n_blocks = 34 - Runtime (sec): 7.57 n = 5 n_blocks = 35 - Runtime (sec): 7.67 n = 5 n_blocks = 36 - Runtime (sec): 7.71 n = 5 n_blocks = 37 - Runtime (sec): 8.0 n = 5 n_blocks = 38 - Runtime (sec): 8.07 n = 5 n_blocks = 39 - Runtime (sec): 7.72 n = 5 n_blocks = 40 - Runtime (sec): 8.03 n = 5 n_blocks = 41 - Runtime (sec): 8.23 n = 5 n_blocks = 42 - Runtime (sec): 8.16 n = 5 n_blocks = 43 - Runtime (sec): 8.43 n = 5 n_blocks = 44 - Runtime (sec): 7.67 n = 5 n_blocks = 45 - Runtime (sec): 7.69 n = 5 n_blocks = 46 - Runtime (sec): 8.1 n = 5 n_blocks = 47 - Runtime (sec): 8.14 n = 5 n_blocks = 48 - Runtime (sec): 7.97 n = 5 n_blocks = 49 - Runtime (sec): 8.15 n = 5 n_blocks = 50 - Runtime (sec): 8.18

ParticularMiner commented 2 years ago

Thanks @iibarant

Question: you are changing n_blocks[1] and not n_blocks[0], right?

If that's the case, then the increase in runtime is likely due to the fact that your DataFrame is too small. Hopefully, a decrease in runtime with increase in n_block[1] will occur with a larger DataFrame with number of rows of the order of 300 000.

iibarant commented 2 years ago

Hi @ParticularMiner,

The original string_grouper package allows to match a field from one dataframe to another field from another dataframe. I'm sure this is possible with your update. What would be the right code to match, let's say 2 fields from one df to 2 fields from another dataframe with fuzzily matching with given weights and one field with exact matching from both dataframes.

According to your description such case is not considered: ''' Function that combines similarity-matching results of several fields of a DataFrame and returns them in another DataFrame :param data_frame: pandas.DataFrame of strings. '''

That would be a great addition to the package.

Thank you very much!

ParticularMiner commented 2 years ago

@iibarant

Thanks for your message.

It was already on my TODO list. I hope to get round to doing it soon. In fact, as soon as the new string_grouper update is merged I intend to do that.

ParticularMiner commented 2 years ago

@iibarant

The latest update to Red String Grouper needs to be tested. But see if the following makes sense:

(Note that all previous functionality (with one DataFrame) has been retained. This attempts comparing 2 DataFrames.)

import pandas as pd
import numpy as np
from red_string_grouper import record_linkage, field, field_pair
inputfilename = 'data/us-cities-real-estate-sample-zenrows.csv'
df = pd.read_csv(inputfilename, dtype=str)
df.set_index('zpid', inplace=True)
df1 = df[['imgSrc', 'addressState', 'statusText']]
df2 = df[['detailUrl', 'addressState', 'statusText']]
matches = record_linkage(
    [df1, df2],
    fields_2b_matched_fuzzily=[field_pair('statusText', 'statusText', weight=3),
                               field_pair('imgSrc', 'detailUrl', weight=4, regex=r'https://|.zillow.com|[,-./:]|\s', min_similarity=0.05)],
    fields_2b_matched_exactly=[field_pair('addressState', 'addressState', weight=4)],
    n_blocks=(1,1))
matches.sort_values(('Fuzzily Matched Fields', 'imgSrc/detailUrl', 'similarity'), ascending=False)
Exactly Matched Fields Fuzzily Matched Fields
addressState/addressState statusText/statusText imgSrc/detailUrl
Weighted Mean Similarity Score left similarity right left similarity right
left_zpid right_zpid
96757735 9469011 0.680908 PA Townhouse for sale 1.0 Townhouse for sale https://photos.zillowstatic.com/fp/e86b55f8ff4... 0.122498 https://www.zillow.com/homedetails/1115-W-9th-...
2075301422 2075905715 0.676702 OK Lot / Land for sale 1.0 Lot / Land for sale https://photos.zillowstatic.com/fp/33759096aea... 0.110931 https://www.zillow.com/homedetails/Moccasin-Tr...
2100136081 2070606357 0.674255 TX Lot / Land for sale 1.0 Lot / Land for sale https://photos.zillowstatic.com/fp/81739685ade... 0.104201 https://www.zillow.com/homedetails/Fm-344-Rd-W...
33628288 2076543613 0.673813 OH Multi-family home for sale 1.0 Multi-family home for sale https://photos.zillowstatic.com/fp/17d6ff62ed2... 0.102986 https://www.zillow.com/homedetails/105-1-2-N-P...
84501900 74747435 0.672341 MI House for sale 1.0 House for sale https://photos.zillowstatic.com/fp/d64d0568f30... 0.098939 https://www.zillow.com/homedetails/Townsend-Rd...
... ... ... ... ... ... ... ... ... ...
2077977874 86740833 0.654574 NH Active 1.0 Active https://photos.zillowstatic.com/fp/b1404f9e886... 0.050079 https://www.zillow.com/homedetails/59-Maple-Ln...
92848559 86797118 0.654568 NH Active 1.0 Active https://photos.zillowstatic.com/fp/8246797fd69... 0.050061 https://www.zillow.com/homedetails/39-John-St-...
2077102415 95343578 0.654566 NH Active 1.0 Active https://photos.zillowstatic.com/fp/9c8b1983578... 0.050057 https://www.zillow.com/homedetails/155-Barrett...
86793455 2070175193 0.654554 NH Active 1.0 Active https://photos.zillowstatic.com/fp/155c840ddee... 0.050023 https://www.zillow.com/homedetails/208-26-Hamm...
32607416 32816743 0.654553 NY House for sale 1.0 House for sale https://photos.zillowstatic.com/fp/9d07d42d4dd... 0.050021 https://www.zillow.com/homedetails/70-Main-St-...

638 rows × 8 columns

ParticularMiner commented 2 years ago

@iibarant

Because I lack meaningful data. I would appreciate it very much if you could form a meaningful example using two DataFrames that I could include in the README.md file (that is, the documentation). Alternatively you could edit the README.md file yourself and post a pull request.

iibarant commented 2 years ago

Hi @ParticularMiner,

Today is Thanksgiving day in Canada and I’m far from my laptop. Will do my best tomorrow.

Thank you!

On Oct 11, 2021, at 8:58 AM, ParticularMiner @.***> wrote:

 @iibarant

Because I lack meaningful data. I would appreciate it very much if you could form a meaningful example using two DataFrames that I could include in the README.md file (that is, the documentation). Alternatively you could edit the README.md file yourself and post a pull request.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

ParticularMiner commented 2 years ago

@iibarant

No problem! Enjoy Thanksgiving!

iibarant commented 2 years ago

Hi @ParticularMiner,

Why do I get the following error message:

ImportError: cannot import name 'field' from 'red_string_grouper' (/opt/anaconda3/lib/python3.8/site-packages/red_string_grouper/init.py)

while running from red_string_grouper import record_linkage, field, field_pair

The terminal did not show any updates: Screen Shot 2021-10-12 at 9 34 52 AM

ParticularMiner commented 2 years ago

@iibarant

I think pip is reading from the cache. Try pip install --upgrade red_string_grouper

iibarant commented 2 years ago

@ParticularMiner

There are 2 error messages following that command

pip install -- upgrade red_string_grouper

ERROR: Could not find a version that satisfies the requirement upgrade (from versions: none) ERROR: No matching distribution found for upgrade

ParticularMiner commented 2 years ago

@iibarant

That's weird. I get no such errors. Let me checks the MacOS install test scripts.

ParticularMiner commented 2 years ago

@iibarant

The MacOS test-scripts seem to be working:

https://github.com/ParticularMiner/red_string_grouper/runs/3869210363?check_suite_focus=true

ParticularMiner commented 2 years ago

@iibarant

It seems you are also using conda, then Try conda update red_string_grouper

iibarant commented 2 years ago

@ParticularMiner

PackageNotInstalledError: Package is not installed in prefix. prefix: /opt/anaconda3 package name: red_string_grouper

ParticularMiner commented 2 years ago

Then try

conda install red_string_grouper

iibarant commented 2 years ago

Already did:

conda install red_string_grouper Collecting package metadata (current_repodata.json): done Solving environment: failed with initial frozen solve. Retrying with flexible solve. Collecting package metadata (repodata.json): done Solving environment: failed with initial frozen solve. Retrying with flexible solve.

PackagesNotFoundError: The following packages are not available from current channels:

Current channels:

To search for alternate channels that may provide the conda package you're looking for, navigate to

https://anaconda.org

and use the search bar at the top of the page.

ParticularMiner commented 2 years ago

@iibarant Ok. It seems the conda channels don't yet have red_string_grouper.

So first deactivate: conda deactivate

Then try: pip install --upgrade red_string_grouper

iibarant commented 2 years ago

@ParticularMiner

In this case I get pip: command not found

But activating conda again and pip install -- upgrade red_string_grouper did the trick:

Successfully uninstalled red-string-grouper-0.0.2 Successfully installed red-string-grouper-0.1.0.post1 sparse-dot-topn-for-blocks-0.3.1.post3 topn-0.0.7

ParticularMiner commented 2 years ago

@iibarant

Weird ... but OK. If it worked, that's fine!

iibarant commented 2 years ago

@ParticularMiner,

That's for sure, but why running the following code


matches = record_linkage(
    [df, df_to_match],
    fields_2b_matched_fuzzily=[field_pair('full address', 'ADDRESSES', weight=3),
                               field_pair('name', 'NAME', weight=1, min_similarity=0)],
    fields_2b_matched_exactly=[field_pair('foad', 'STATE', weight=6)],
    n_blocks=(1,1)) 

I receive this error message:

File "/opt/anaconda3/lib/python3.8/site-packages/red_string_grouper/red_string_grouper.py", line 316, in record_linkage global_config = StringGrouperConfig(**kwargs)

TypeError: new() got an unexpected keyword argument 'n_blocks'

Was n_blocks deactivated?

ParticularMiner commented 2 years ago

@iibarant

That one is on me. It 's because here on my machine, I'm importing the new string_grouper. You are using the old one. I'm still waiting for the maintainer to update string_grouper to the newest version.

Let me see if I can fix it in the meantime.

ParticularMiner commented 2 years ago

@iibarant

OK. It should be working now. You need to pip install --upgrade red_string_grouper

Perhaps you would need to repeat the same magic you used above. 😄

iibarant commented 2 years ago

Hi @ParticularMiner,

I did that and it seems to install the package. However, running the following code

from red_string_grouper import record_linkage, field, field_pair

matches = record_linkage(
    [df, df_to_match],
    fields_2b_matched_fuzzily=[field_pair('full address', 'ADDRESSES', weight=3, min_similarity=0.5),
                               field_pair('name', 'NAME', weight=1, min_similarity=0)],
    fields_2b_matched_exactly=[field_pair('foad', 'STATE', weight=6)],
    n_blocks=(1,1))

I got this error:

raise TypeError('Master input does not consist of pandas.Series containing only Strings')

TypeError: Master input does not consist of pandas.Series containing only Strings

ParticularMiner commented 2 years ago

@iibarant

This usually happens when there are non-strings present in the data, like Null/None/'',numbers, etc.

ParticularMiner commented 2 years ago

@iibarant

Follow this link for a possible solution.

iibarant commented 2 years ago

@ParticularMiner,

fillna(‘’) did the job, I’m just busy with other projects.

Sent from my iPhone

On Oct 13, 2021, at 10:35 AM, ParticularMiner @.***> wrote:

 @iibarant

Follow this link for a possible solution.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

iibarant commented 2 years ago

Hi @ParticularMiner,

When I try to use the previous approach with only one df to find duplicates, I get error messages

record_matches = record_linkage(df_to_check,
               fields_2b_matched_fuzzily=[field('ADDRESS', min_similarity=0.8, ngram_size=3, weight=1.5)],
               fields_2b_matched_fuzzily=[field('NAME', min_similarity=0.8, ngram_size=3, weight=2.5)],
               fields_2b_matched_exactly=[('STATE', weight=3)],
               hierarchical=False,
               max_n_matches=8200,
               force_symmetries=True,
               n_blocks=(1, 1))

Screen Shot 2021-10-14 at 12 38 50 PM

Removing weight from the STATE, the error is Screen Shot 2021-10-14 at 12 39 49 PM

What am I supposed to do here?

ParticularMiner commented 2 years ago

Hi @iibarant

There can only be one 'fields_2b_matched_fuzzily'. Remember? It is a single list of one or more fields.

record_matches = record_linkage(df_to_check,
               fields_2b_matched_fuzzily=[field('ADDRESS', min_similarity=0.8, ngram_size=3, weight=1.5),
                                          field('NAME', min_similarity=0.8, ngram_size=3, weight=2.5)],
               fields_2b_matched_exactly=[field('STATE', weight=3)],
               hierarchical=False,
               max_n_matches=8200,
               force_symmetries=True,
               n_blocks=(1, 1))
iibarant commented 2 years ago

Great, thanks. I got a lot on my plates now.

On Oct 14, 2021, at 2:30 PM, ParticularMiner @.***> wrote:

 Hi @iibarant

There can only be one 'fields_2b_matched_fuzzily'. Remember? It is a single list of one or more fields.

record_matches = record_linkage(df_to_check, fields_2b_matched_fuzzily=[field('ADDRESS', min_similarity=0.8, ngram_size=3, weight=1.5), field('NAME', min_similarity=0.8, ngram_size=3, weight=2.5)], fields_2b_matched_exactly=[field('STATE', weight=3)], hierarchical=False, max_n_matches=8200, force_symmetries=True, n_blocks=(1, 1)) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

ParticularMiner commented 2 years ago

Take your time @iibarant

No pressure! There's no deadline. At least not from me. I can wait.

iibarant commented 2 years ago

@ParticularMiner,

I got another error message using exact code you shared:

record_matches = record_linkage(df_to_check,
               fields_2b_matched_fuzzily=[field('ADDRESS', min_similarity=0.8, ngram_size=3, weight=1.5),
                                          field('NAME', min_similarity=0.8, ngram_size=3, weight=2.5)],
               fields_2b_matched_exactly=[field('STATE', weight=3)],
               hierarchical=False,
               max_n_matches=8200,
               force_symmetries=True,
               n_blocks=(1, 1))

Screen Shot 2021-10-14 at 4 11 13 PM

Screen Shot 2021-10-14 at 4 14 14 PM

ParticularMiner commented 2 years ago

@iibarant

What's happening here is that you've got only trivial matches. That is, only records which match themselves. You will need to lower the min_similarity parameter to get meaningful matches. I guess that since you've got only a few records in the dataframe the similarity scores will be quite low. So min_similarity should be very low indeed.

What about ngram_size for 'STATE'? I have often seen you use a value of 2. Not this time?

iibarant commented 2 years ago

Hi @ParticularMiner, sorry for delays, I have a very busy schedule.

Reducing min_similarity did not help as instead of: Screen Shot 2021-10-18 at 9 53 43 AM

I got: Screen Shot 2021-10-18 at 9 54 53 AM

The STATE field uses 2 letters abbreviation, so 3-grams would be irrelevant, but I thought for exact match that parameter is not needed, so I only specified the weight.

Thank you!

ParticularMiner commented 2 years ago

@iibarant

Oh yes, you are right. I failed to realize STATE was one of the exact fields. Sorry. What value did you use for min_similarity this time?

iibarant commented 2 years ago

0.1, but it looks like the new update works with 2 dataframes, but gives an error when I use only one df to get duplicates.

ParticularMiner commented 2 years ago

@ParticularMiner

I've tried it with one dataframe here and it has worked. But let me try it once more!

iibarant commented 2 years ago

Here's exactly what I tried:

record_matches = record_linkage(df_to_check,
               fields_2b_matched_fuzzily=[field('ADDRESS', min_similarity=0.1, ngram_size=3, weight=1.5),
                                          field('NAME', min_similarity=0.1, ngram_size=3, weight=2.5)],
               fields_2b_matched_exactly=[field('STATE', weight=6)],
               hierarchical=False,
               max_n_matches=500,
               force_symmetries=True,
               n_blocks=(1, 1))
ParticularMiner commented 2 years ago

@iibarant

Sorry. That bug is a fault of mine. I had introduced parenthesis in an expression which were supposed to group operations but python had interpreted them as a tuple instead.

Anyway, I think I've fixed it now. You would have to pip install --upgrade red_string_grouper again. Let me know if it works. Sorry again!

iibarant commented 2 years ago

@ParticularMiner, thank you, it works now!

Beside many caveat warnings, everything is good. I currently set warnings.filterwarnings("ignore"), but going forward for a library it would be a good thing to solve.

SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame
ParticularMiner commented 2 years ago

@iibarant

Great! Yes, that warning is difficult to resolve because I cannot tell which line it is coming from. Does your machine give a traceback log?