How to compare words - Githubissues

Datafable / epu-index

EPU index

http://www.applieddatamining.com/cms/?q=content/economic-policy-uncertainty-index

1 stars 0 forks source link

How to compare words #56

Closed bartaelterman closed 9 years ago

bartaelterman commented 9 years ago

There are 3 cases where we need to compare words:

Scoring and article by applying a weight to every word in the text.
Counting the number of unique words and determining their term frequency to build a word cloud.
Removing stop words from a text before determining the word frequencies.

How exactly do we compare words? I would propose:

Case insensitive
Include the following characters: - and & (e.g. in names of political parties).

Drawbacks:

Including & will match political parties such as CD&V, but I see no obvious way to match SP.A as including a dot would also append this character to the last word of each sentence.
Frequency counts will consider Grieks and Griekse as 2 different words.
Possibly difficulties with special characters in names.

bartaelterman commented 9 years ago

Text will be cleaned first to remove punctuation (see #55). All words are then set to lowercase and compared.