Clustering methods - Githubissues

Yomguithereal commented 7 years ago

Hello @harold. Can you tell me a bit more about the strings you are matching in your use-cases? Length? Language etc.

harold commented 7 years ago

Thanks for your interest, and for these libraries. I really respect what you're doing.

There are two main use cases that I have used clj-fuzzy heavily for:

1) A customer of ours had created by hand a few hundred spreadsheets tracking trends in an industry over time. So the strings were things like company names and attributes (indicating which attributes were a focus of which companies over time). So, the strings were short; length < 100. The problem was in each sheet the names of the companies and attributes were hand typed and so naturally contained variation in spelling, abbreviations, etc... We made a unified and normalized database that enabled analysis and exploration of the data that wasn't really possible when the data was in hundreds of sheets.

2) In a separate financial project I used clj-fuzzy to cross-reference various datasets by company name (essentially joining several sheets by reconciling different identifiers through company names). Company names are also short; length < 100. In this case performance was closer to being a problem since a few of these datasets were medium sized (thousands or tens-of-thousands of rows) and joining them meant m*n queries that involved millions of invocations of jaro-winkler.

Re: Language, this was all in English, but with some non-dictionary words (e.g., made up attributes and company names).

Yomguithereal commented 7 years ago

First of all, are you trying to find name matching fuzzily than a human operator will explicitly decide to harmonize here or are your scripts taking the decision on their own according to some arbitrary threshold because precision might not be such a critical issue for later computations/aggregations?

Anyhow, one way to speed up the process (but maybe you already used this method somehow), is to aggressively normalize strings using a fingerprint method.

The idea is to:

Trim
Lowercase
Drop punctuation & control characters
Split into word tokens
Drop some stopwords
Keeping only unique tokens
Sorting the tokens
Rejoining them using a single space
Dropping diacritics

It's the general idea but you can easily adapt the method to better fit your data.

Then, you can just join strings in the same clusters if they have the same fingerprint. Just use a multimap or equivalent for that. It works in linear time, is very cheap to compute and usually has good precision.

Yomguithereal commented 7 years ago

One other way, if you need to keep distance computations because of typos etc., is either to perform some kind of clever grouping of the strings before computing pairwise computations to reduce the bounds of the quadratic execution (this is called blocking, or sorted neighborhood) or to use data structures able to perform efficient nearest neighbor queries in metric space (a Vantage Point tree for instance).

Exemple: you can group your strings by matching 6-gram and then perform pairwise computations in each group, reducing greatly needed computations since you won't compare obviously different strings.

You can see a lot of those methods implemented in the library here.

I can explain one of those methods better if you want me to.

Also, there is another method that run in linear time that approximates Jaccard coefficient pairwise computations using a technique called minhashing. I can also explain it better if you so desire.

harold commented 7 years ago

First of all, are you trying to find name matching fuzzily than a human operator will explicitly decide to harmonize here or are your scripts taking the decision on their own according to some arbitrary threshold because precision might not be such a critical issue for later computations/aggregations?

I have done both. When the output was too big to possibly check each entry by hand I would spot check a few hundred samples and then conservatively set a threshold based on what I saw. I also once had a case where I knew approximately how many matches/clusters I wanted and set the threshold automatically to achieve that.

Thank you for the explanation of fingerprinting. I see how if typos (e.g., transposed letters) are not a factor, then fingerprinting can reduce the geometric time complexity. I will keep this technique in mind for future cases.

Thank you also for the cogent explanation of 'blocking'... I had definitely considered this, but didn't know the term or any of the methods. I had thought of first blocking by lower case first letter or first-two-letter pairs (assuming that there are no typos there exactly). Blocking cuts the number of pairwise comparisons dramatically.

And I also appreciate the pointer to minshash, we use LSH in other contexts, but that particular technique I had not seen.

Thank you kindly for all this great information. If I can ever be of service to you, let me know.

Yomguithereal commented 7 years ago

Thank you also for the cogent explanation of 'blocking'... I had definitely considered this, but didn't know the term or any of the methods. I had thought of first blocking by lower case first letter or first-two-letter pairs (assuming that there are no typos there exactly). Blocking cuts the number of pairwise comparisons dramatically.

Using 5/6/7-grams in your case should yield better precision. There are some papers out there that demonstrate why but I cannot find the references easily. Blocking by first letter or letters is also a good pick since it was demonstrated that typos very rarely occur on first letters (but this is more true for dictionary words than for names).

Thank you kindly for all this great information. If I can ever be of service to you, let me know.

Just come back to me if you find some other ways to cut computations or enhance precisions of the matching algorithms etc. Record linkage is kind of my pet peeves and I love finding new things.

harold commented 7 years ago

When you say 5/6/7-grams in this context are you talking about letters or words?

Yomguithereal commented 7 years ago

For 5/6/7-grams letters. If you work at word level, I doubt increasing n further than 3 would yield good results given your use case.

Yomguithereal commented 7 years ago

There is also a scheme called skip-grams that is becoming quite popular. But more expensive to compute obviously. But a 1-skip -2/3-grams could yield good results with your company names' word tokens.

Yomguithereal / talisman

Clustering methods #136