Open ZhihaoMa opened 3 years ago
Hi @ZhihaoMa
Thanks for your interest in string_grouper.
Before now, have you used dask DataFrames with string_grouper with success? The reason I ask is to find out if the error is being caused by your use of dask rather than Chinese characters.
I use dask FataFrames because the csv file is too large (~20G). When I directly use Pandas (pd.read_csv), the errors are:
Traceback (most recent call last):
File "C:/Users/acemec/Documents/firm_data/name_match.py", line 9, in
@ZhihaoMa
I understand that.
What I want to know is this: when you use pandas Dataframe with a small dataset of Chinese strings, does string_grouper work or not?
If it works, then the problem is coming from dask, not the Chinese characters.
If it does not work, then the problem is the Chinese characters.
@ZhihaoMa
string_grouper was not made with dask in mind. That being said, I see that considering dask as a viable alternative to pandas would be very useful. Perhaps a future version of string_grouper will support it.
So I would be very grateful if you could let me know the answer to the above question, to know how best to incorporate dask into string_grouper.
@ParticularMiner Sorry for responding late. The package works well for Chinese files after encoding. But I find it doesn't support dask. When using dd.read_csv, I find:
TypeError: Input does not consist of pandas.Series containing only Strings
Thanks @ZhihaoMa
I will take a closer look at dask. Or have you found another way?
Hello @ZhihaoMa How did you go about the encoding? Can you explain what you did exactly? I am facing the same issue
Encountered this error and solved it by adding .astype(str):
companyNames = names['Name'].astype(str).drop_duplicates()
df = sg.match_strings(companyNames)
Encountered this error and solved it by adding .astype(str):
companyNames = names['Name'].astype(str).drop_duplicates() df = sg.match_strings(companyNames)
Thanks!
Hi, I try to match the Chinese firm name and get errors
File "C:/Users/acemec/Documents/firm_data/name_match.py", line 14, in
matches = match_most_similar(companies['company_name'], new_companies['assignee'], ignore_index=True)
File "C:\Users\acemec\anaconda3\lib\site-packages\string_grouper\string_grouper.py", line 108, in match_most_similar
string_grouper = StringGrouper(master,
File "C:\Users\acemec\anaconda3\lib\site-packages\string_grouper\string_grouper.py", line 218, in init
raise TypeError('Input does not consist of pandas.Series containing only Strings')
TypeError: Input does not consist of pandas.Series containing only Strings
Here is my code:
import pandas as pd import numpy as np from string_grouper import match_strings, match_most_similar, group_similar_strings, compute_pairwise_similarities, StringGrouper import dask.dataframe as dd company_names = 'C:/Users/acemec/Documents/firm_data/company_annual.csv' companies = dd.read_csv(company_names, on_bad_lines='skip',dtype=str,low_memory=False)
new_companies_name = 'C:/Users/acemec/Documents/firm_data/Pat_firm_list.csv' new_companies = dd.read_csv(new_companies_name, on_bad_lines='skip',dtype=str,low_memory=False)
matches = match_most_similar(companies['company_name'], new_companies['assignee'], ignore_index=True)
match_result = pd.concat([new_companies, matches], axis=1)
df = pd.DataFrame(match_result) df.to_csv('C:/Users/acemec/Documents/firm_data/file_name.csv', encoding='utf-8')
Could you give me some suggestions?