Lyonk71 / pandas-dedupe

Simplifies use of the Dedupe library via Pandas
135 stars 30 forks source link

ValueError: Cannot take a larger sample than population when replace is False #55

Open SSMK-wq opened 2 years ago

SSMK-wq commented 2 years ago

I am trying to dedupe my dataframe which has a column Test_names. I have only around 40 rows

So, I tried the below code from this tutorial https://pypi.org/project/pandas-dedupe/

df = pd.read_excel('names.xlsx')
df_clean = pandas_dedupe.dedupe_dataframe(df,['Test_names'])

I got the below error

ValueError: Cannot take a larger sample than population when replace is False

I also tried the below

df_clean = pd.read_excel('clean_names.xlsx')
df_messy = pd.read_excel('test_names.xlsx')

#initiate deduplication
df_final = pandas_dedupe.gazetteer_dataframe(df_clean, df_messy, 'Test_names', canonicalize=True)

And got the same error

ValueError: Cannot take a larger sample than population when replace is False

quancore commented 2 years ago

same problem @Lyonk71 @ieriii

ieriii commented 2 years ago

Thanks for reporting this. I had a look and the easiest fix is to downgrade dedupe to version 2.0.13.

I'll have a closer look at the latest release of dedupe (version 2.0.14) and see how we can ensure compatiblity. Let me know if it works or have any further questions.

sarbaniAi commented 2 years ago

Hi all, I am using the postgresql approach with own data ~40K records. I am getting the same error "ValueError: Cannot take a larger sample than population when replace is False".