Bergvca / string_grouper

Super Fast String Matching in Python
MIT License
364 stars 76 forks source link

Error When matching Chinese name #73

Open ZhihaoMa opened 3 years ago

ZhihaoMa commented 3 years ago

Hi, I try to match the Chinese firm name and get errors

File "C:/Users/acemec/Documents/firm_data/name_match.py", line 14, in matches = match_most_similar(companies['company_name'], new_companies['assignee'], ignore_index=True) File "C:\Users\acemec\anaconda3\lib\site-packages\string_grouper\string_grouper.py", line 108, in match_most_similar string_grouper = StringGrouper(master, File "C:\Users\acemec\anaconda3\lib\site-packages\string_grouper\string_grouper.py", line 218, in init raise TypeError('Input does not consist of pandas.Series containing only Strings') TypeError: Input does not consist of pandas.Series containing only Strings

Here is my code:

import pandas as pd import numpy as np from string_grouper import match_strings, match_most_similar, group_similar_strings, compute_pairwise_similarities, StringGrouper import dask.dataframe as dd company_names = 'C:/Users/acemec/Documents/firm_data/company_annual.csv' companies = dd.read_csv(company_names, on_bad_lines='skip',dtype=str,low_memory=False)

new_companies_name = 'C:/Users/acemec/Documents/firm_data/Pat_firm_list.csv' new_companies = dd.read_csv(new_companies_name, on_bad_lines='skip',dtype=str,low_memory=False)

matches = match_most_similar(companies['company_name'], new_companies['assignee'], ignore_index=True)

match_result = pd.concat([new_companies, matches], axis=1)

df = pd.DataFrame(match_result) df.to_csv('C:/Users/acemec/Documents/firm_data/file_name.csv', encoding='utf-8')

Could you give me some suggestions?

ParticularMiner commented 3 years ago

Hi @ZhihaoMa

Thanks for your interest in string_grouper.

Before now, have you used dask DataFrames with string_grouper with success? The reason I ask is to find out if the error is being caused by your use of dask rather than Chinese characters.

ZhihaoMa commented 3 years ago

I use dask FataFrames because the csv file is too large (~20G). When I directly use Pandas (pd.read_csv), the errors are:

Traceback (most recent call last): File "C:/Users/acemec/Documents/firm_data/name_match.py", line 9, in companies = pd.read_csv(company_names, on_bad_lines='skip',dtype=str,low_memory=False) File "C:\Users\acemec\anaconda3\lib\site-packages\pandas\util_decorators.py", line 311, in wrapper return func(*args, **kwargs) File "C:\Users\acemec\anaconda3\lib\site-packages\pandas\io\parsers\readers.py", line 586, in read_csv return _read(filepath_or_buffer, kwds) File "C:\Users\acemec\anaconda3\lib\site-packages\pandas\io\parsers\readers.py", line 488, in _read return parser.read(nrows) File "C:\Users\acemec\anaconda3\lib\site-packages\pandas\io\parsers\readers.py", line 1047, in read index, columns, col_dict = self._engine.read(nrows) File "C:\Users\acemec\anaconda3\lib\site-packages\pandas\io\parsers\c_parser_wrapper.py", line 228, in read data = self._reader.read(nrows) File "pandas_libs\parsers.pyx", line 783, in pandas._libs.parsers.TextReader.read File "pandas_libs\parsers.pyx", line 872, in pandas._libs.parsers.TextReader._read_rows File "pandas_libs\parsers.pyx", line 1925, in pandas._libs.parsers.raise_parser_error pandas.errors.ParserError: Error tokenizing data. C error: out of memory

ParticularMiner commented 3 years ago

@ZhihaoMa

I understand that.

What I want to know is this: when you use pandas Dataframe with a small dataset of Chinese strings, does string_grouper work or not?

If it works, then the problem is coming from dask, not the Chinese characters.

If it does not work, then the problem is the Chinese characters.

ParticularMiner commented 3 years ago

@ZhihaoMa

string_grouper was not made with dask in mind. That being said, I see that considering dask as a viable alternative to pandas would be very useful. Perhaps a future version of string_grouper will support it.

So I would be very grateful if you could let me know the answer to the above question, to know how best to incorporate dask into string_grouper.

ZhihaoMa commented 3 years ago

@ParticularMiner Sorry for responding late. The package works well for Chinese files after encoding. But I find it doesn't support dask. When using dd.read_csv, I find:

TypeError: Input does not consist of pandas.Series containing only Strings

ParticularMiner commented 3 years ago

Thanks @ZhihaoMa

I will take a closer look at dask. Or have you found another way?

vherasme commented 2 years ago

Hello @ZhihaoMa How did you go about the encoding? Can you explain what you did exactly? I am facing the same issue

liri2006 commented 1 year ago

Encountered this error and solved it by adding .astype(str):

companyNames = names['Name'].astype(str).drop_duplicates()
df = sg.match_strings(companyNames)
ZhihaoMa commented 1 year ago

Encountered this error and solved it by adding .astype(str):

companyNames = names['Name'].astype(str).drop_duplicates()
df = sg.match_strings(companyNames)

Thanks!