Open plopenrefine opened 4 years ago
Makes sense. I can see two approaches:
i understand that adding this feature in the gui may take some time, which line should one alter in current version's code, so as to add the following stopwords along with the punctuation? "a", "and", "co", etc Thank you in advance
could someone show me an example of how the above line should be altered?
if i use say the following line, and alter it so as to replace the and with "" an empty string, shall i not achieve what i want?
Hello,
for my specific needs, i need to make a small adjustment to the way the key collision + fingerprint algorithm works. In a step of creating the clusters, the algorithm removes all punctuation and control characters. (In this step, i need to also remove some stop words: "press", "and", "co", "the", "a" for example).
It would be great if the clustering algorithm be made modular so that it could accept more values that the algorithm ignores during the cluster creation
The way that the key collision + fingerprint method works to create clusters is:
remove leading and trailing whitespace change all characters to their lowercase representation remove all punctuation and control characters (in this step i need to add some stop words that will not be taken into account) normalize extended western characters to their ASCII representation (for example "gödel" → "godel") split the string into whitespace-separated tokens sort the tokens and remove duplicates join the tokens back together
I do not want to remove the stopwords from the column i try to cluster before the algorithm takes place, i just need to find the clusters, and see the original values of the clustered values, but without having
Taylor Francis Taylor and Francis in two different clusters
I need them in the same cluster, as one value Taylor Francis