chbrown / liwc-python

Linguistic Inquiry and Word Count (LIWC) analyzer
MIT License
193 stars 50 forks source link

How to show the specific category only? #9

Closed myrainbowandsky closed 4 years ago

myrainbowandsky commented 4 years ago
c_counts = Counter(category for token in Corpus['text'][1] for category in parse((token)))

I got :

Counter({'social (Social)': 74,
         'verb (Verbs)': 97,
         'drives (Drives)': 49,
         'reward (Reward)': 9,
         'focuspresent (Present Focus)': 66,
         'function (Function Words)': 334,
         '

 (Conjunctions)': 43,
         'adj (Adjectives)': 41,
         'affect (Affect)': 47,
         'posemo (Positive Emotions)': 42,
         'work (Work)': 44,
         'article (Articles)': 70,
         'pronoun (Pronouns)': 45,
         'ipron (Impersonal Pronouns)': 28,
         'relativ (Relativity)': 81,
         'motion (Motion)': 4,
         'time (Time)': 26,
         'prep (Prepositions)': 123,
         'percept (Perceptual Processes)': 28,
         'hear (Hear)': 22,
         'auxverb (Auxiliary Verbs)': 43,
         'compare (Comparisons)': 31,
         'quant (Quantifiers)': 69,
         'cogproc (Cognitive Processes)': 63,
         'tentat (Tentative)': 21,
         'adverb (Adverbs)': 23,
         'space (Space)': 54,
         'power (Power)': 30,
         'interrog (Interrogatives)': 14,
         'certain (Certainty)': 5,
         'ppron (Personal Pronouns)': 17,
         'they (They)': 13,
         'focuspast (Past Focus)': 24,
         'number (Numbers)': 6,
         'money (Money)': 16,
         'achieve (Achievement)': 17,
         'differ (Differentiation)': 21,
         'see (See)': 4,
         'insight (Insight)': 11,
         'discrep (Discrepancies)': 9,
         'focusfuture (Future Focus)': 3,
         'bio (Biological Processes)': 4,
         'health (Health)': 4,
         'i (I)': 1,
         'affiliation (Affiliation)': 3,
         'leisure (Leisure)': 2,
         'cause (Causal)': 3,
         'home (Home)': 3,
         'negemo (Negative Emotions)': 5,
         'sad (Sad)': 2,
         'risk (Risk)': 3,
         'shehe (SheHe)': 3,
         'male (Male)': 2,
         'negate (Negations)': 6,
         'female (Female)': 1,
         'anx (Anx)': 1})

What if I just want to show, say,

'focuspast (Past Focus)': 24, 
'focusfuture (Future Focus)': 3,
'cogproc (Cognitive Processes)': 63,

I have a stupid method using

name_list=[
    'focuspast (Past Focus)', 'focusfuture (Future Focus)', 'cogproc (Cognitive Processes)',
]
for k, v in c_counts.items():
    while k in name_list:
        print(k, v)
        break

May I have any more smarter approach?

chbrown commented 4 years ago

Hi @myrainbowandsky, sorry for the delayed response. This is really up to you, and your "stupid" method is pretty similar to what I might do in your position — there's nothing inherently wrong with filtering down to certain categories out of the full results.

That said, if you're looking for elegance, there are a couple fixes I'd recommend:

# extra parentheses around "token" when calling "parse(token)" are unnecessary
c_counts = Counter(category for token in Corpus['text'][1] for category in parse(token))

# use a set for faster performance — Python can check whether an item is in a set in constant
# time, vs. asking if `some_string in some_list`, in which case Python looks at each item in
# `some_list` and checks whether `some_list[i] == some_string`
name_list = {
    'focuspast (Past Focus)', 
    'focusfuture (Future Focus)', 
    'cogproc (Cognitive Processes)',
}

for k, v in c_counts.items():
    # a simple "if" will do the same thing as your "while ... break"
    if k in name_list:
        print(k, v)

You could also create a new count dictionary, rather than filtering while you're printing:

selected_c_counts = {k: v for k, v in c_counts.items() if k in name_list}

Finally — and this is a bit more involved — if you're really wanting to optimize performance, you could dig into the source code and filter out unwanted categories while parsing the lexicon, which would reduce the size of the trie used to look up matches for each token. But that's probably a lot more work than you need to do to solve your problem :)