Watts-Lab / team_comm_tools

An open-source Python library that turns multiparty conversational data into social-science backed features.
https://teamcommtools.seas.upenn.edu/
MIT License
3 stars 5 forks source link

Bring Your Own LIWC #281

Open xehu opened 3 months ago

xehu commented 3 months ago

We include a corpus of words in our features called LIWC, and they are under a license that says they can only be free if they are distributed for an academic purpose. The current version of these features in our repository are the 2007 version of the lexicon. However, users may wish to "bring" another version of the dictionary and ask us to generate up-to-date features for them.

Proposed Feature - Bring your own LIWC: LIWC is a lexicon, which means it is literally a list of words. We simply ask the user to point us to a local directory where the word list is stored, and we will calculate the metrics for them.

xehu commented 2 months ago

Implementation Specification: "Bring Your Own LIWC"

Per email communications with Jamie Pennebaker and Ryan Boyd, the LIWC team is OK with us having the encrypted/compressed version of the old LIWC dictionary as part of the Team Communication Toolkit. However, users with access to more recent versions of the LIWC dictionary may be interested in "plugging in" their own local copy of a more up-to-date dictionary and using that instead of using our cached 2007 version.

The objective of "Bring Your Own LIWC," or BYO-LIWC, is to make this possible. Broadly, it requires three steps:

Step 1: Read in the Dictionary

Step 1a: Modify FeatureBuilder to Accept Custom LIWC Dictionary

First, we must modify the FeatureBuilder to accept a parameter called ("custom_liwc_dictionary").

Step 1b: Read in .dic File

Once we are able to get a parameter for the path to the dictionary, we need to parse the contents.

The .dic file has two parts: the header, which maps numbers to "category names," and the body, which maps words in the lexicon to different categories.

The Header

The header is a portion of text data between two '%' symbols. It looks like this...

1   category1
2   category2
3   category3
4   category4
...

This means that the number "1" corresponds to "category 1."

The Body

In the body, the leftmost item is a word or word stem (e.g., run* is a stem that captures "run," "running," etc.). Next to each stem is a sequence of numbers, which tells us the categorie(s) that the given word belongs to:

word1   20  30  31  50  51
word2*  30  31  50  51
word3   20  30  31  50  51  90

For example, in this case, word1 belongs to multiple categories (20, 30, 31, 50, and 51).

Ryan has provided in his email some possible starting scripts for parsing this format: https://github.com/ryanboyd/ContentCoder-Py/blob/main/ContentCodingDictionary.py#L81

Step 2: Confirm that the Expected Dictionary Matches the Format

In this step, we should add error handling in case:

If this is the case, present a warning and continue building the features without the custom LIWC features.

Step 3: Convert the LIWC dictionary to a regular expression

Next, we need to convert the LIWC dictionary format to a regular expression that we can easily incorporate into the way that we compute features. Currently, an example of how we process lexicons is in: https://github.com/Watts-Lab/team_comm_tools/blob/main/src/team_comm_tools/utils/check_embeddings.py (utils/check_embeddings)

# Read in the lexicons (helper function for generating the pickle file)
def read_in_lexicons(directory, lexicons_dict):
    for filename in os.listdir(directory):
        with open(directory/filename, encoding = "mac_roman") as lexicons:
            if filename.startswith("."):
                continue
            lines = []
            for lexicon in lexicons:
                # get rid of parentheses
                lexicon = lexicon.strip()
                lexicon = lexicon.replace('(', '')
                lexicon = lexicon.replace(')', '')
                if '*' not in lexicon:
                    lines.append(r"\b" + lexicon.replace("\n", "") + r"\b")
                else:
                    # get rid of any cases of multiple repeat -- e.g., '**'
                    lexicon = lexicon.replace('\**', '\*')

                    # build the final lexicon
                    lines.append(r"\b" + lexicon.replace("\n", "").replace("*", "") + r"\S*\b")
        clean_name = re.sub('.txt', '', filename)
        lexicons_dict[clean_name] = "|".join(lines)

Step 4: Generate the features

Currently, an example of how we generate the LIWC features is as follows (https://github.com/Watts-Lab/team_comm_tools/blob/main/src/team_comm_tools/features/lexical_features_v2.py):

def liwc_features(chat_df: pd.DataFrame, message_col) -> pd.DataFrame:
    """
        This function takes in the chat level input dataframe and computes lexical features 
        (rates at which the message contains contains words from a given lexicon, such as LIWC).

    Args:
        chat_df (pd.DataFrame): This is a pandas dataframe of the chat level features. Should contain 'message' column.
        message_col (str): This is a string with the name of the column containing the message / text.

    Returns:
        pd.DataFrame: Dataframe of the lexical features stacked as columns.
    """
    # Load the preprocessed lexical regular expressions
    try:
        current_dir = os.path.dirname(__file__)
        lexicon_pkl_file_path = os.path.join(current_dir, './assets/lexicons_dict.pkl')
        lexicon_pkl_file_path = os.path.abspath(lexicon_pkl_file_path)
        with open(lexicon_pkl_file_path, "rb") as lexicons_pickle_file:
            lexicons_dict = pickle.load(lexicons_pickle_file)

        # Return the lexical features stacked as columns
        return pd.concat(
            # Finding the # of occurrences of lexicons of each type for all the messages.
            [pd.DataFrame(chat_df[message_col + "_original"].apply(lambda chat: get_liwc_rate(regex, chat)))\
                                            .rename({message_col + "_original": lexicon_type + "_lexical_per_100"}, axis=1)\
                for lexicon_type, regex in lexicons_dict.items()], 
            axis=1
        )
    except:
        print("WARNING: Lexicons not found. Skipping feature..."

The lexicons_dict.pkl file contains lexicons other than just LIWC, and my thought is that comparing the 2007 version to the 2015 version might actually be valuable. So, rather than simply replacing the 2007 version, those who pass in a valid dictionary file would just get a new set of additional features generated.

What this could look like is a modification to liwc_features above, so that it takes the dictionary as an argument; then, we could call it once on the "regular" (2007) dictionary, and once on the "new" (user-provided) dictionary. We could then calculate_chat_level_features so that the function will make the second call only if the user provided a valid dictionary; otherwise, we maintain status quo.

Step 5: Clean up, Test, and Document

When developing this feature, we should try it with the 2015 version of the dictionary that Ryan provided, as well as invalid versions (to check that error checking functions as expected).

We'll also want to update the documentation to explain how users should expect to use this feature, e.g., here: https://conversational-featurizer.readthedocs.io/en/latest/basics.html

sundy1994 commented 1 month ago

Read in .dic File

I borrowed Ryan's code for loading .dic files: https://github.com/Watts-Lab/team_comm_tools/blob/yuxuan/bring_your_own_liwc/src/team_comm_tools/utils/check_embeddings.py#L179

In feature builder, if the user provides a valid path that ends in .dic, we'll try to load the dictionary. We can add more error checking in the future. https://github.com/Watts-Lab/team_comm_tools/blob/yuxuan/bring_your_own_liwc/src/team_comm_tools/feature_builder.py#L132

Convert to RegEx

I followed the same method in read_in_lexicons(). I updated it a bit cause lexicon = lexicon.replace('\**', '\*') will throw an invalid syntax error.

Generate the features

Per the instruction above, if the custom LIWC dict is provided, we'll generate a second set of lexicon features. I can further update this after #306 is solved.