Open xehu opened 3 months ago
Per email communications with Jamie Pennebaker and Ryan Boyd, the LIWC team is OK with us having the encrypted/compressed version of the old LIWC dictionary as part of the Team Communication Toolkit. However, users with access to more recent versions of the LIWC dictionary may be interested in "plugging in" their own local copy of a more up-to-date dictionary and using that instead of using our cached 2007 version.
The objective of "Bring Your Own LIWC," or BYO-LIWC, is to make this possible. Broadly, it requires three steps:
.dic
format shared by Ryan)First, we must modify the FeatureBuilder to accept a parameter called ("custom_liwc_dictionary").
Once we are able to get a parameter for the path to the dictionary, we need to parse the contents.
The .dic
file has two parts: the header, which maps numbers to "category names," and the body, which maps words in the lexicon to different categories.
The header is a portion of text data between two '%' symbols. It looks like this...
1 category1
2 category2
3 category3
4 category4
...
This means that the number "1" corresponds to "category 1."
In the body, the leftmost item is a word or word stem (e.g., run*
is a stem that captures "run," "running," etc.). Next to each stem is a sequence of numbers, which tells us the categorie(s) that the given word belongs to:
word1 20 30 31 50 51
word2* 30 31 50 51
word3 20 30 31 50 51 90
For example, in this case, word1 belongs to multiple categories (20, 30, 31, 50, and 51).
Ryan has provided in his email some possible starting scripts for parsing this format: https://github.com/ryanboyd/ContentCoder-Py/blob/main/ContentCodingDictionary.py#L81
In this step, we should add error handling in case:
.dic
format.If this is the case, present a warning and continue building the features without the custom LIWC features.
Next, we need to convert the LIWC dictionary format to a regular expression that we can easily incorporate into the way that we compute features. Currently, an example of how we process lexicons is in: https://github.com/Watts-Lab/team_comm_tools/blob/main/src/team_comm_tools/utils/check_embeddings.py (utils/check_embeddings
)
# Read in the lexicons (helper function for generating the pickle file)
def read_in_lexicons(directory, lexicons_dict):
for filename in os.listdir(directory):
with open(directory/filename, encoding = "mac_roman") as lexicons:
if filename.startswith("."):
continue
lines = []
for lexicon in lexicons:
# get rid of parentheses
lexicon = lexicon.strip()
lexicon = lexicon.replace('(', '')
lexicon = lexicon.replace(')', '')
if '*' not in lexicon:
lines.append(r"\b" + lexicon.replace("\n", "") + r"\b")
else:
# get rid of any cases of multiple repeat -- e.g., '**'
lexicon = lexicon.replace('\**', '\*')
# build the final lexicon
lines.append(r"\b" + lexicon.replace("\n", "").replace("*", "") + r"\S*\b")
clean_name = re.sub('.txt', '', filename)
lexicons_dict[clean_name] = "|".join(lines)
Currently, an example of how we generate the LIWC features is as follows (https://github.com/Watts-Lab/team_comm_tools/blob/main/src/team_comm_tools/features/lexical_features_v2.py):
def liwc_features(chat_df: pd.DataFrame, message_col) -> pd.DataFrame:
"""
This function takes in the chat level input dataframe and computes lexical features
(rates at which the message contains contains words from a given lexicon, such as LIWC).
Args:
chat_df (pd.DataFrame): This is a pandas dataframe of the chat level features. Should contain 'message' column.
message_col (str): This is a string with the name of the column containing the message / text.
Returns:
pd.DataFrame: Dataframe of the lexical features stacked as columns.
"""
# Load the preprocessed lexical regular expressions
try:
current_dir = os.path.dirname(__file__)
lexicon_pkl_file_path = os.path.join(current_dir, './assets/lexicons_dict.pkl')
lexicon_pkl_file_path = os.path.abspath(lexicon_pkl_file_path)
with open(lexicon_pkl_file_path, "rb") as lexicons_pickle_file:
lexicons_dict = pickle.load(lexicons_pickle_file)
# Return the lexical features stacked as columns
return pd.concat(
# Finding the # of occurrences of lexicons of each type for all the messages.
[pd.DataFrame(chat_df[message_col + "_original"].apply(lambda chat: get_liwc_rate(regex, chat)))\
.rename({message_col + "_original": lexicon_type + "_lexical_per_100"}, axis=1)\
for lexicon_type, regex in lexicons_dict.items()],
axis=1
)
except:
print("WARNING: Lexicons not found. Skipping feature..."
The lexicons_dict.pkl
file contains lexicons other than just LIWC, and my thought is that comparing the 2007 version to the 2015 version might actually be valuable. So, rather than simply replacing the 2007 version, those who pass in a valid dictionary file would just get a new set of additional features generated.
What this could look like is a modification to liwc_features
above, so that it takes the dictionary as an argument; then, we could call it once on the "regular" (2007) dictionary, and once on the "new" (user-provided) dictionary. We could then calculate_chat_level_features
so that the function will make the second call only if the user provided a valid dictionary; otherwise, we maintain status quo.
When developing this feature, we should try it with the 2015 version of the dictionary that Ryan provided, as well as invalid versions (to check that error checking functions as expected).
We'll also want to update the documentation to explain how users should expect to use this feature, e.g., here: https://conversational-featurizer.readthedocs.io/en/latest/basics.html
I borrowed Ryan's code for loading .dic files: https://github.com/Watts-Lab/team_comm_tools/blob/yuxuan/bring_your_own_liwc/src/team_comm_tools/utils/check_embeddings.py#L179
In feature builder, if the user provides a valid path that ends in .dic, we'll try to load the dictionary. We can add more error checking in the future. https://github.com/Watts-Lab/team_comm_tools/blob/yuxuan/bring_your_own_liwc/src/team_comm_tools/feature_builder.py#L132
I followed the same method in read_in_lexicons()
. I updated it a bit cause lexicon = lexicon.replace('\**', '\*')
will throw an invalid syntax error.
Per the instruction above, if the custom LIWC dict is provided, we'll generate a second set of lexicon features. I can further update this after #306 is solved.
We include a corpus of words in our features called LIWC, and they are under a license that says they can only be free if they are distributed for an academic purpose. The current version of these features in our repository are the 2007 version of the lexicon. However, users may wish to "bring" another version of the dictionary and ask us to generate up-to-date features for them.
Proposed Feature - Bring your own LIWC: LIWC is a lexicon, which means it is literally a list of words. We simply ask the user to point us to a local directory where the word list is stored, and we will calculate the metrics for them.