Bergvca / string_grouper

Super Fast String Matching in Python
MIT License
364 stars 76 forks source link

created a more user-friendly error message when bad data is found #56

Open ParticularMiner opened 3 years ago

ParticularMiner commented 3 years ago

Notify: @Bergvca

Here I suggest a solution to one user's problem (see https://github.com/Bergvca/string_grouper/pull/43#issuecomment-824591895). It was a bit more difficult to implement than I thought. :)

import random
import string
from datetime import datetime
import pandas as pd
import numpy as np
from string_grouper import compute_pairwise_similarities

Create a Series with a few random strings:

strings = [ ''.join(random.choices(string.ascii_uppercase + string.digits, k=10)) for i in range(20) ]
good_series = pd.Series(strings, name='left')
good_series.to_frame()
left
0 6P1UMBC8D8
1 ONWZTJ53E1
2 TO7AADMIAD
3 6Y1QDGIKZ5
4 J53R2HZI96
5 Q383BO2VLK
6 0KINOSJ5JU
7 J8AHSMJNOE
8 IZL32I7VPC
9 9RHVQHA0N3
10 XUVDL96FDL
11 M7ROKPJ2IQ
12 MNXWZHRBPJ
13 1QSN3KG4DM
14 UW9EC83LDH
15 DHZLAQHUWI
16 M6HP4FH88Z
17 CNMKI44QWZ
18 DCVVKSSUO7
19 27B9P0B68L

Generate another Series of strings with some bad (non-string or empty string) values:

bad_series = pd.Series(
    random.choices(
        [None, np.nan, "", datetime.now()]*5 + 
        strings + 
        [i for i in range(111, 115)]
        , k=20
    ),
    name='right'
).rename_axis('id')
bad_series.to_frame()
right
id
0 MNXWZHRBPJ
1 M6HP4FH88Z
2 1QSN3KG4DM
3
4 None
5 2021-05-09 12:27:18.736565
6 2021-05-09 12:27:18.736565
7 2021-05-09 12:27:18.736565
8 DCVVKSSUO7
9 MNXWZHRBPJ
10 27B9P0B68L
11 IZL32I7VPC
12 UW9EC83LDH
13 112
14 MNXWZHRBPJ
15 1QSN3KG4DM
16 None
17 None
18 None
19 NaN

Notice the error message after the traceback log:

compute_pairwise_similarities(good_series, bad_series)
---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-10-56153281113f> in <module>
----> 1 compute_pairwise_similarities(good_series, bad_series)

~\eclipse-workspace\string_grouper\string_grouper\string_grouper.py in this(*args, **kwargs)
     61     # function "this" in the first parameter position
     62     def this(*args, **kwargs):
---> 63         return func(this, *args, **kwargs)
     64     return this
     65 

~\eclipse-workspace\string_grouper\string_grouper\string_grouper.py in compute_pairwise_similarities(this, string_series_1, string_series_2, **kwargs)
     86         this.issues = sg.issues
     87         this.issues.rename(f'Non-strings in Series {sname}', inplace=True)
---> 88         raise TypeError(sg.error_msg(sname, 'compute_pairwise_similarities'))
     89     return sg.dot()
     90 

TypeError: 

ERROR: Input pandas Series 'right' (string_series_2) contains values that are not strings!
Display the pandas Series 'compute_pairwise_similarities.issues' to find where these values are:
   Non-strings in Series 'right' (string_series_2)
id                                                
3                                                 
4                                             None
5                       2021-05-09 12:27:18.736565
6                       2021-05-09 12:27:18.736565
7                       2021-05-09 12:27:18.736565
13                                             112
16                                            None
17                                            None
18                                            None
19                                             NaN
compute_pairwise_similarities.issues
id
3                               
4                           None
5     2021-05-09 12:27:18.736565
6     2021-05-09 12:27:18.736565
7     2021-05-09 12:27:18.736565
13                           112
16                          None
17                          None
18                          None
19                           NaN
Name: Non-strings in Series 'right' (string_series_2), dtype: object

Similar functionality exists for the other high-level functions: group_similar_strings(), match_most_similar() and match_strings()

ParticularMiner commented 3 years ago

Hi @Bergvca

Just noticed you merged the other PR. If you intend to merge the next two, perhaps it would be best to start with this one as it has fewer changes than the other.

:)

Bergvca commented 3 years ago

ok thanks, will do :)