Scoring and article by applying a weight to every word in the text.
Counting the number of unique words and determining their term frequency to build a word cloud.
Removing stop words from a text before determining the word frequencies.
How exactly do we compare words? I would propose:
Case insensitive
Include the following characters: - and & (e.g. in names of political parties).
Drawbacks:
Including & will match political parties such as CD&V, but I see no obvious way to match SP.A as including a dot would also append this character to the last word of each sentence.
Frequency counts will consider Grieks and Griekse as 2 different words.
Possibly difficulties with special characters in names.
There are 3 cases where we need to compare words:
How exactly do we compare words? I would propose:
-
and&
(e.g. in names of political parties).Drawbacks:
&
will match political parties such asCD&V
, but I see no obvious way to matchSP.A
as including a dot would also append this character to the last word of each sentence.Grieks
andGriekse
as 2 different words.