OpenRefine / OpenRefine

OpenRefine is a free, open source power tool for working with messy data and improving it
https://openrefine.org/
BSD 3-Clause "New" or "Revised" License
10.91k stars 1.96k forks source link

Specify stop words to remove for key collision + fingerprint method #3200

Open plopenrefine opened 4 years ago

plopenrefine commented 4 years ago

Hello,

for my specific needs, i need to make a small adjustment to the way the key collision + fingerprint algorithm works. In a step of creating the clusters, the algorithm removes all punctuation and control characters. (In this step, i need to also remove some stop words: "press", "and", "co", "the", "a" for example).

It would be great if the clustering algorithm be made modular so that it could accept more values that the algorithm ignores during the cluster creation

The way that the key collision + fingerprint method works to create clusters is:

remove leading and trailing whitespace change all characters to their lowercase representation remove all punctuation and control characters (in this step i need to add some stop words that will not be taken into account) normalize extended western characters to their ASCII representation (for example "gödel" → "godel") split the string into whitespace-separated tokens sort the tokens and remove duplicates join the tokens back together

I do not want to remove the stopwords from the column i try to cluster before the algorithm takes place, i just need to find the clusters, and see the original values of the clustered values, but without having

Taylor Francis Taylor and Francis in two different clusters

I need them in the same cluster, as one value Taylor Francis

wetneb commented 4 years ago

Makes sense. I can see two approaches:

plopenrefine commented 4 years ago

i understand that adding this feature in the gui may take some time, which line should one alter in current version's code, so as to add the following stopwords along with the punctuation? "a", "and", "co", etc Thank you in advance

wetneb commented 4 years ago

You could add that here: https://github.com/OpenRefine/OpenRefine/blob/c76e2b9a461ed5b353ebf5c80e0e0cad2163331c/main/src/com/google/refine/clustering/binning/FingerprintKeyer.java#L93

plopenrefine commented 4 years ago

could someone show me an example of how the above line should be altered?

if i use say the following line, and alter it so as to replace the and with "" an empty string, shall i not achieve what i want?

https://github.com/OpenRefine/OpenRefine/blob/c76e2b9a461ed5b353ebf5c80e0e0cad2163331c/main/src/com/google/refine/clustering/binning/FingerprintKeyer.java#L57