dedupeio / dedupe

:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.
https://docs.dedupe.io
MIT License
4.14k stars 549 forks source link

ConvergenceWarning during training #1091

Open NickCrews opened 2 years ago

NickCrews commented 2 years ago

I get this warning during the fitting of the linear model when performing a deduplication task:

/Users/nickcrews/Library/Application Support/hatch/env/virtual/noatak-UM6-FHel/noatak/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py:444: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

I am training on 800 records, manually labeled with cluster ids. Out of these 800*800 = 640,000 possible pairs, I'm sampling 50,000 using `dedupe.training_data_dedupe(), and feeding these 50k pairs to Dedupe.train(). After expanding Missing, Categorical, and Interaction variables, the X array that the linear model is seeing has 32 columns.

I know this isn't reproducible as yet, but I was hoping to avoid that work of getting everything together, if the information above is enough to give you any insights. If needed, I can try to make something reproducible.

NickCrews commented 2 years ago

Ok, so if I go in and monkeypatch

https://github.com/dedupeio/dedupe/blob/220efe557cf91a9e82215443c1805bfeaf3e1860/dedupe/labeler.py#L72-L77

to sklearn.linear_model.LogisticRegression(max_iter=1000), increasing max_iter from the default of 100 to 1000, the warning goes away.

IDK if this has some downside. Doing the LogisticRegression.fit() takes .005 seconds without the tweak, and half a second with the change, so slower but totally ignoreable.

I'm getting the same accuracy score in both cases, but that is measured after I do some post-processing cleanup, so I'm not sure it reflects the actual accuracy of the classifier. It seems like if the classifier hasn't converged then it would be bound to not be as accurate.

Want me to make a PR that increases max_iter? Think there might be something else causing the problem? It makes me a little nervous that I might not be going after the root cause of the problem and the real problem is sitting there unsolved (eg the warning tells you to look at pre-processing/scaling the data). But I don't see a downside to increasing max_iter?

fgregg commented 2 years ago

i think this warning is not really a problem. typically when you don't have convergence it acts like a regularizer. I have a problem with increasing the max_iter, but there will still be some times where this warning appears.