digitalmethodsinitiative / 4cat

The 4CAT Capture and Analysis Toolkit provides modular data capture & analysis for a variety of social media platforms.
Other
241 stars 58 forks source link

Tokeniser exclusion list ignores last word in list #275

Closed robbydigital closed 1 year ago

robbydigital commented 2 years ago

I'm filtering some commonly used words out of a corpus with the Tokenise processor and it only seems to be partially successful. For example in one month there are 37,325 instances of one word. When I add the word to the reject list there are still 6307 instances of the word. So it's getting most but not at all. I'm having the same issue with some common swear words that I'm trying to filter out - most are gone, but some remain. Is there a reason for this?

Thanks for any insight!

robbydigital commented 2 years ago

Ok so when I tokenised my list of exclusions included "word," a few common curse words, and as the last item in the list of exclusions "word's" (a common formulation of this word in my dataset). When I checked the tokenised strings against the dataset I noticed the formulation "word's" had been included for some reason. I tried again with "word's" as the second item in the list of exclusions and for whatever reason it worked. I don't really know why. But if anyone else has this problem, give that a go I guess...

stijn-uva commented 2 years ago

Thanks for the investigation @robbydigital :) This sounds like something we should actually double-check, so I'm reopening this to keep it on our to-do list!

dale-wahl commented 2 years ago

@robbydigital, is the dataset available publicly? I do not see anything immediately obvious in the code that could account for that bug.

robbydigital commented 2 years ago

@dale-wahl yes, it's public - it's this one: https://4cat.oilab.nl/results/2cca8dd122068fa0f4179f040eba1a01/. Hopefully that link should work. You should be able to see that I tried the tokeniser several times with various exclusion lists.

When my exclusion list was "npc, fuck, fucking, shit, npc's" I was till getting about 6307 "npc" tokens. I looked through the JSON file of tokenised strings and compared it against the full dataset and it appeared that "npc's" and "npcs" were both being tokenised as "npc" in the 2018-10 JSON file that I was checking the full dataset against.

When I revised the exclusion list to "npc, npc's, npcs, npc?, npc!, npc's!, npcs!, npcs?, fuck, fucking, shit" it solved the problem for me. I couldn't find any "npc" tokens in the 2018-10 JSON file.

Perhaps it's not actually a bug but an issue with excluding acronyms, although that doesn't explain why so many "npc's" were retained when I listed that on the exclusion list initially.

dale-wahl commented 1 year ago

So there are some interesting things going on here, but I think the order of everything is likely the cause of the behavior you experienced.

  1. Use the chosen tokenizer (tweet or word) to break a document into tokens/words
  2. Check if the exact token/word is in the reject list
  3. Stem the token/word
  4. Lemmatise the token/word

We reject words before they are stemmed/lemmoned(?) so they match exactly what you intend. This way if you really don't want to hear about "farmers" we do not also reject "farm", "farms", "farming", etc. However, if you reject the word "npc", the word "npcs" will not match and the stemmer will turn "npcs" into "npc" and count that word.

Looking at your datasets, it had nothing to do with ignoring the last word and simply that "npcs" was not rejected and was changed by the lemmatiser to "npc" in your case. I hope that makes sense.

As a side note for others who may have a similar experience, depending on which tokenizer you choose, breaking words apart in the first step is handled differently.

nltk word_tokenize has it's own rules on breaking up apostrophes which can sometimes seem odd. For example, it purposefully breaks a word like "we'll" in to "we" and "'ll" (which represents "we" and "will"). You can look at some of the differences between the word_tokenize and the tweettokenizer here. It seems like the tweettokenizer isn't breaking apostrophes in the same way.

I'm not exactly sure if either tokenizer would break "npc's" into "npc" and "'s" or would keep it as "npc's", but neither of those is an exact match to "npcs" which would leave some remaining "npc" stemmed tokens.

dale-wahl commented 1 year ago

I'm going to close this issue @robbydigital since I do not think there is an action for us to take, but do let us know if you experience anything else odd that we should take a look at or seems like a bug. Thanks for reporting!