import random
import string
from datetime import datetime
import pandas as pd
import numpy as np
from string_grouper import compute_pairwise_similarities
Create a Series with a few random strings:
strings = [ ''.join(random.choices(string.ascii_uppercase + string.digits, k=10)) for i in range(20) ]
good_series = pd.Series(strings, name='left')
good_series.to_frame()
left
0
6P1UMBC8D8
1
ONWZTJ53E1
2
TO7AADMIAD
3
6Y1QDGIKZ5
4
J53R2HZI96
5
Q383BO2VLK
6
0KINOSJ5JU
7
J8AHSMJNOE
8
IZL32I7VPC
9
9RHVQHA0N3
10
XUVDL96FDL
11
M7ROKPJ2IQ
12
MNXWZHRBPJ
13
1QSN3KG4DM
14
UW9EC83LDH
15
DHZLAQHUWI
16
M6HP4FH88Z
17
CNMKI44QWZ
18
DCVVKSSUO7
19
27B9P0B68L
Generate another Series of strings with some bad (non-string or empty string) values:
bad_series = pd.Series(
random.choices(
[None, np.nan, "", datetime.now()]*5 +
strings +
[i for i in range(111, 115)]
, k=20
),
name='right'
).rename_axis('id')
bad_series.to_frame()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-10-56153281113f> in <module>
----> 1 compute_pairwise_similarities(good_series, bad_series)
~\eclipse-workspace\string_grouper\string_grouper\string_grouper.py in this(*args, **kwargs)
61 # function "this" in the first parameter position
62 def this(*args, **kwargs):
---> 63 return func(this, *args, **kwargs)
64 return this
65
~\eclipse-workspace\string_grouper\string_grouper\string_grouper.py in compute_pairwise_similarities(this, string_series_1, string_series_2, **kwargs)
86 this.issues = sg.issues
87 this.issues.rename(f'Non-strings in Series {sname}', inplace=True)
---> 88 raise TypeError(sg.error_msg(sname, 'compute_pairwise_similarities'))
89 return sg.dot()
90
TypeError:
ERROR: Input pandas Series 'right' (string_series_2) contains values that are not strings!
Display the pandas Series 'compute_pairwise_similarities.issues' to find where these values are:
Non-strings in Series 'right' (string_series_2)
id
3
4 None
5 2021-05-09 12:27:18.736565
6 2021-05-09 12:27:18.736565
7 2021-05-09 12:27:18.736565
13 112
16 None
17 None
18 None
19 NaN
compute_pairwise_similarities.issues
id
3
4 None
5 2021-05-09 12:27:18.736565
6 2021-05-09 12:27:18.736565
7 2021-05-09 12:27:18.736565
13 112
16 None
17 None
18 None
19 NaN
Name: Non-strings in Series 'right' (string_series_2), dtype: object
Similar functionality exists for the other high-level functions: group_similar_strings(), match_most_similar() and match_strings()
Just noticed you merged the other PR. If you intend to merge the next two, perhaps it would be best to start with this one as it has fewer changes than the other.
Notify: @Bergvca
Here I suggest a solution to one user's problem (see https://github.com/Bergvca/string_grouper/pull/43#issuecomment-824591895). It was a bit more difficult to implement than I thought. :)
Create a Series with a few random strings:
Generate another Series of strings with some bad (non-string or empty string) values:
Notice the error message after the traceback log:
Similar functionality exists for the other high-level functions:
group_similar_strings()
,match_most_similar()
andmatch_strings()