created a more user-friendly error message when bad data is found

ParticularMiner commented 3 years ago

Notify: @Bergvca

Here I suggest a solution to one user's problem (see https://github.com/Bergvca/string_grouper/pull/43#issuecomment-824591895). It was a bit more difficult to implement than I thought. :)

import random
import string
from datetime import datetime
import pandas as pd
import numpy as np
from string_grouper import compute_pairwise_similarities

Create a Series with a few random strings:

strings = [ ''.join(random.choices(string.ascii_uppercase + string.digits, k=10)) for i in range(20) ]
good_series = pd.Series(strings, name='left')
good_series.to_frame()

	left
0	6P1UMBC8D8
1	ONWZTJ53E1
2	TO7AADMIAD
3	6Y1QDGIKZ5
4	J53R2HZI96
5	Q383BO2VLK
6	0KINOSJ5JU
7	J8AHSMJNOE
8	IZL32I7VPC
9	9RHVQHA0N3
10	XUVDL96FDL
11	M7ROKPJ2IQ
12	MNXWZHRBPJ
13	1QSN3KG4DM
14	UW9EC83LDH
15	DHZLAQHUWI
16	M6HP4FH88Z
17	CNMKI44QWZ
18	DCVVKSSUO7
19	27B9P0B68L

Generate another Series of strings with some bad (non-string or empty string) values:

bad_series = pd.Series(
    random.choices(
        [None, np.nan, "", datetime.now()]*5 + 
        strings + 
        [i for i in range(111, 115)]
        , k=20
    ),
    name='right'
).rename_axis('id')
bad_series.to_frame()

	right
id
0	MNXWZHRBPJ
1	M6HP4FH88Z
2	1QSN3KG4DM
3
4	None
5	2021-05-09 12:27:18.736565
6	2021-05-09 12:27:18.736565
7	2021-05-09 12:27:18.736565
8	DCVVKSSUO7
9	MNXWZHRBPJ
10	27B9P0B68L
11	IZL32I7VPC
12	UW9EC83LDH
13	112
14	MNXWZHRBPJ
15	1QSN3KG4DM
16	None
17	None
18	None
19	NaN

Notice the error message after the traceback log:

compute_pairwise_similarities(good_series, bad_series)

---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-10-56153281113f> in <module>
----> 1 compute_pairwise_similarities(good_series, bad_series)

~\eclipse-workspace\string_grouper\string_grouper\string_grouper.py in this(*args, **kwargs)
     61     # function "this" in the first parameter position
     62     def this(*args, **kwargs):
---> 63         return func(this, *args, **kwargs)
     64     return this
     65 

~\eclipse-workspace\string_grouper\string_grouper\string_grouper.py in compute_pairwise_similarities(this, string_series_1, string_series_2, **kwargs)
     86         this.issues = sg.issues
     87         this.issues.rename(f'Non-strings in Series {sname}', inplace=True)
---> 88         raise TypeError(sg.error_msg(sname, 'compute_pairwise_similarities'))
     89     return sg.dot()
     90 

TypeError: 

ERROR: Input pandas Series 'right' (string_series_2) contains values that are not strings!
Display the pandas Series 'compute_pairwise_similarities.issues' to find where these values are:
   Non-strings in Series 'right' (string_series_2)
id                                                
3                                                 
4                                             None
5                       2021-05-09 12:27:18.736565
6                       2021-05-09 12:27:18.736565
7                       2021-05-09 12:27:18.736565
13                                             112
16                                            None
17                                            None
18                                            None
19                                             NaN

compute_pairwise_similarities.issues

id
3                               
4                           None
5     2021-05-09 12:27:18.736565
6     2021-05-09 12:27:18.736565
7     2021-05-09 12:27:18.736565
13                           112
16                          None
17                          None
18                          None
19                           NaN
Name: Non-strings in Series 'right' (string_series_2), dtype: object

Similar functionality exists for the other high-level functions: group_similar_strings(), match_most_similar() and match_strings()

ParticularMiner commented 3 years ago

Hi @Bergvca

Just noticed you merged the other PR. If you intend to merge the next two, perhaps it would be best to start with this one as it has fewer changes than the other.

:)

Bergvca commented 3 years ago

ok thanks, will do :)

Bergvca / string_grouper

created a more user-friendly error message when bad data is found #56