Cleaning up Technical Debt

Many of the featurizer options in our current codebase had been specific to one of our original datasets or motivated by a specific project --- and these should be cleaned up: turned into a more generalizable format (i.e., giving people a very clear and explicit choice/customization) or removed entirely (i.e., being clear about what formats we do / do not accept).

Here is a (non-comprehensive) list of examples:

[ ] Batch num / round num. One of the original datasets (juries) had a structure where each unique "conversation" was defined by two values: the batch number and the round number. Thus, there is a line in preprocess.py where you group by batch and round num:

    # If data is grouped by batch/round, add a conversation num
    if {'batch_num', 'round_num'}.issubset(df.columns):
        df['conversation_num'] = df.groupby(['batch_num', 'round_num']).ngroup()
        df = df[df.columns.tolist()[-1:] + df.columns.tolist()[0:-1]] # make the new column first

But these names are very specific to the juries dataset! One can imagine that people can have any name for identifying their column. There is another example where we hard-code specific parameters for the multi-task dataset (in which the names are "stageId," "roundId"); this is the create_cumulative_rows function.

[ ] Other dataset-specific names. "Message" could be the column name for any text; "speaker_nickname" could alternatively be named "speaker_id" (in fact, we have an issue for this: https://github.com/Watts-Lab/team-process-map/issues/140).
[ ] Parameters for conversation and user features. What if users don't want conversation- or user-level features at all? We might want some way to specify (i.e., by passing in "none" or another keyword) that a certain level of analysis isn't relevant to a user, so they should not be generated.
[ ] Aggregation options. Which columns do the users want aggregated, and using which aggregation options? Aggregating by everything yields thousands of features --- this is way too many! Maybe this should be something that we pass to the user to specify: for example, maybe they are only interested in the mean function (not mean, median, max, min, std...)
[ ] Feature selection. Some features are simply irrelevant for certain contexts. For example, discursive diversity is relevant for a single team that works together for an extended period of time --- it's largely irrelevant/undefined if the conversation is very short, or if certain individuals do not continuously contribute over various segments of a conversation. We may want to provide users with the option to specify upfront which features they think are relevant (or, alternatively, which features they want to drop, and we include everything by default).
[ ] Different preprocessing options. Users may have custom requirements for preprocessing; we typically preprocess to remove uppercase and punctuation (with certain features as exceptions); however, users may want to pass in a specific preprocessing function (even a custom one). An example of this is Priya's Reddit project, in which many users quote one another (thus, the text should be preprocessed to remove the repetition of quoted text): https://github.com/Watts-Lab/team-process-map/pull/199
[ ] Ability to Force Regeneration of BERT/Sentiment Vectors. Currently, the check_embeddings files checks whether embeddings exist for a dataset, and generates them if they do not yet exist. However, if they already exist, there is no way to force regenerating them (unless you delete the files). This design assumed that datasets do not change; however, in reality, they do change! We add new rows to test datasets all the time, and we therefore make it possible to specify when to regenerate embeddings.
[ ] Calling the feature builder in ways that are not featurize.py. Increasingly, featurize.py is turning into more of a testing file --- it's just one script that is used to call the FeatureBuilder. Users will be calling the FeatureBuilder in iPython notebooks, or in their own scripts / setups. We may consider renaming the file something like run_tests.py, and putting it into the testing folder as the primary script that runs our tests. We no longer need all the (currently commented-out) calls to other datasets; those belong in the respective projects using each dataset, and not in the FeatureBuilder, which we intend to separately package up.

Open Discussion: What's the User Experience of Using our Featurizer?

More broadly, these examples point to a larger problem: what do we want the user experience of the featurizer to look like? Let's use this issue to come up with a list of what we want to get cleaned up and how we want to design things.

Proposed Solution: Add a YAML/Config File.

What if we asked for the user's preferences via a YAML/Config file, then built a featurizer to the user's specifications? This might create a clean interface to allow the user to state all of their preferred options in a single, clean location, and for us to abstract away some of the decisions that are otherwise hardcoded or reflect technical debt.

The flow would look something like this: TPM Design Flow_page-0001

Another way of thinking about this

We can think about the overall "flow" of our featurizer as having the following steps: featurizer_abstractions

At each step of this flow, we currently have specific design decisions / settings --- for example, preprocessing with lowercase/punctuation removed by default; using SBERT embeddings (rather than any other type of embedding); keeping all features (rather than some specific subset); and using mean, max, min, std to aggregate. What if, at a bare minimum, we just think through ways that we can let the user have more control over this --- to, in other words, control whether they want to use the defaults, or use something else?

What we expect from the user

We would expect the user to give us something like the following:

To download the package (e.g., pip install) and easily set up dependencies
To give us a configuration file / arguments in the correct format
To have arguments that point to valid data (that contains, at minimum, speakers, messages, and conversation identifiers)
To declare a FeatureBuilder object and call .featurize(); they can do this in any folder.

What the user should expect from us

To take the data and process it according to the user's specifications
To output the data in the path specified by the user
To have correct outputs --- that are trustworthy and unit-tested.

Example of repo with config file: https://github.com/xehu/tpm-horse-race-modeling

[ ] speaker_nickname, message and conversation_num The original features require columns to have these specific names, whereas a potential user's data might not contain these exact columns.

Proposed Solution:

Curate a minimum list of columns required for a user to run

[ ] speaker_nickname, message and conversation_num The original features require columns to have these specific names, whereas a potential user's data might not contain these exact columns.

Proposed Solution:

Curate a minimum list of columns required for a user to run

@PriyaDCosta -- I think the broader point behind this is that we have to make a design decision; how "strict" do we want to be in terms of demanding the user set up their dataframe in a way that works for our system, and how "flexible" do we want to be?

At one end of the extreme, we can simply say, "OK, this is our formatting requirement. We ask you to rename your columns before passing your data to us, sorry."

At the other end of the extreme, we can say, "We just need something that refers to a speaker ID, something that refers to a conversation ID, and something that refers to the message inside. We do not care what you name them, as long as you tell us the names."

Figuring out where we want to be in this spectrum --- what do we want to ask the user to configure? What is ALLOWED to be configured? --- is a design decision that we need to have a clear philosophy on.

A possible resource for thinking about modularizing the featurizer: https://maartengr.github.io/BERTopic/algorithm/algorithm.html#visual-overview

I think BERTopic has fantastic resources/documentation on this, and this could be a good model for us; the idea would be that we have strong defaults, but users should be able to override them (i.e., they can use a different way of vectorizing if they want, or a different way of aggregating if they want...)

The open question is: how do we want to design the interface for that? How do we connect it smoothly? Where do we let users put in their preferences? Config file, or something else?

Interface design thoughts:

Config files vs constructor arguments

For the config file, the diagram lists dataset path, output path, and feature list, but I think these could be more easily set up with constructor arguments. For example, we already instantiate FeatureBuilder with the input and output paths. We could also modify FeatureBuilder parameters to include feature selections.

For more detailed and less frequently changed settings like preprocessing details or feature-specific options, there can be an option for users to provide a Config file (that can get passed into def load_config(self, filepath) function that we would need to create in the FeatureBuilder class). If a Config file is not provided, the default could be to turn everything on. I think this setup provides straightforward instantiation of FeatureBuilder with necessary inputs while allowing deep customization without cluttering the primary interface.

The Config file could look something like:

features:
  include: ["feature1", "feature2"]
  exclude: ["feature3", "feature4"]
  aggregation:
    methods: ["mean", "std"]
    columns: ["column1", "column2"]

preprocessing:
  remove_uppercase: true
  remove_punctuation: false
  custom_preprocess: "path/to/custom_script.py"

embeddings:
  force_regeneration: true

Users could then load their Config file after instantiating the FeatureBuilder like fb.load_config(path) and before calling featurize() on it.

Accessing featurize.py

I'm still not sure I understand why users, after downloading the package, would need to run the featurize.py file in the package. What tests would it run, and why would users need it? I thought users would create their separate code file and import and instantiate FeatureBuilder there. featurize.py as a standalone script from the package wouldn't have information about the user's input/output paths, feature selections, etc, so I am also a bit confused on how it would run / what purpose it would serve.

If there is a need for users to be able to run a script from our package after installing it, I found a popular UI for this would be using entry points. This way, users can just execute the script in the command line.

I'm still not sure I understand why users, after downloading the package, would need to run the featurize.py file in the package. What tests would it run, and why would users need it? I thought users would create their separate code file and import and instantiate FeatureBuilder there. featurize.py as a standalone script from the package wouldn't have information about the user's input/output paths, feature selections, etc, so I am also a bit confused on how it would run / what purpose it would serve.

@zhouhelena I love these thoughts!! To answer the question quoted above: I think that featurize.py currently serves two purposes, which likely need to be separated:

It runs the tests. When people push to our repository, I want them to have an easy way to check that the features they are building work, and confirm that nothing in the pipeline breaks. Currently, featurize.py functions as a file that really should be named run_tests.py: it declares a FeatureBuilder on the test dataset and confirms that the assertions are correct. We want to retain this in the packaged version.
It is a sample for how users are supposed to use the interface. When users download our package for the first time, they won't know how to declare a FeatureBuilder or what they need to pass in. The fact that there is an existing script with a sample that they can copy/paste is nice. Users are not going to run featurize.py per se; they're going to want to read the code and copy it into whatever they're writing. In this case, I actually think we'll likely want to replace the script with something like a demo notebook that shows people what the interface and configuration options are.

I really like the idea of making the config optional, so that people don't have to pass anything in if they like the defaults, but the more advanced users can use the configs to customize what they get.

Adding image describing changes from 04/19/24 discussion:

Watts-Lab / team_comm_tools

[Discussion] Technical Debt and Design Decisions #198

Cleaning up Technical Debt

Open Discussion: What's the User Experience of Using our Featurizer?

Proposed Solution: Add a YAML/Config File.

Another way of thinking about this

What we expect from the user

What the user should expect from us

Proposed Solution:

Proposed Solution: