Watts-Lab / team_comm_tools

An open-source Python library that turns multiparty conversational data into social-science backed features.
https://teamcommtools.seas.upenn.edu/
MIT License
3 stars 5 forks source link

[Discussion] Technical Debt and Design Decisions #198

Closed xehu closed 7 months ago

xehu commented 7 months ago

Cleaning up Technical Debt

Many of the featurizer options in our current codebase had been specific to one of our original datasets or motivated by a specific project --- and these should be cleaned up: turned into a more generalizable format (i.e., giving people a very clear and explicit choice/customization) or removed entirely (i.e., being clear about what formats we do / do not accept).

Here is a (non-comprehensive) list of examples:

    # If data is grouped by batch/round, add a conversation num
    if {'batch_num', 'round_num'}.issubset(df.columns):
        df['conversation_num'] = df.groupby(['batch_num', 'round_num']).ngroup()
        df = df[df.columns.tolist()[-1:] + df.columns.tolist()[0:-1]] # make the new column first

But these names are very specific to the juries dataset! One can imagine that people can have any name for identifying their column. There is another example where we hard-code specific parameters for the multi-task dataset (in which the names are "stageId," "roundId"); this is the create_cumulative_rows function.

Open Discussion: What's the User Experience of Using our Featurizer?

More broadly, these examples point to a larger problem: what do we want the user experience of the featurizer to look like? Let's use this issue to come up with a list of what we want to get cleaned up and how we want to design things.

Proposed Solution: Add a YAML/Config File.

What if we asked for the user's preferences via a YAML/Config file, then built a featurizer to the user's specifications? This might create a clean interface to allow the user to state all of their preferred options in a single, clean location, and for us to abstract away some of the decisions that are otherwise hardcoded or reflect technical debt.

The flow would look something like this: TPM Design Flow_page-0001

Another way of thinking about this

We can think about the overall "flow" of our featurizer as having the following steps: featurizer_abstractions

At each step of this flow, we currently have specific design decisions / settings --- for example, preprocessing with lowercase/punctuation removed by default; using SBERT embeddings (rather than any other type of embedding); keeping all features (rather than some specific subset); and using mean, max, min, std to aggregate. What if, at a bare minimum, we just think through ways that we can let the user have more control over this --- to, in other words, control whether they want to use the defaults, or use something else?

What we expect from the user

We would expect the user to give us something like the following:

What the user should expect from us

xehu commented 7 months ago

Example of repo with config file: https://github.com/xehu/tpm-horse-race-modeling

PriyaDCosta commented 7 months ago

Proposed Solution:

Curate a minimum list of columns required for a user to run

xehu commented 7 months ago
  • [ ] speaker_nickname, message and conversation_num The original features require columns to have these specific names, whereas a potential user's data might not contain these exact columns.

Proposed Solution:

Curate a minimum list of columns required for a user to run

See related issue: https://github.com/Watts-Lab/team-process-map/issues/140

@PriyaDCosta -- I think the broader point behind this is that we have to make a design decision; how "strict" do we want to be in terms of demanding the user set up their dataframe in a way that works for our system, and how "flexible" do we want to be?

At one end of the extreme, we can simply say, "OK, this is our formatting requirement. We ask you to rename your columns before passing your data to us, sorry."

At the other end of the extreme, we can say, "We just need something that refers to a speaker ID, something that refers to a conversation ID, and something that refers to the message inside. We do not care what you name them, as long as you tell us the names."

Figuring out where we want to be in this spectrum --- what do we want to ask the user to configure? What is ALLOWED to be configured? --- is a design decision that we need to have a clear philosophy on.

xehu commented 7 months ago

A possible resource for thinking about modularizing the featurizer: https://maartengr.github.io/BERTopic/algorithm/algorithm.html#visual-overview

I think BERTopic has fantastic resources/documentation on this, and this could be a good model for us; the idea would be that we have strong defaults, but users should be able to override them (i.e., they can use a different way of vectorizing if they want, or a different way of aggregating if they want...)

The open question is: how do we want to design the interface for that? How do we connect it smoothly? Where do we let users put in their preferences? Config file, or something else?

zhouhelena commented 7 months ago

Interface design thoughts:

For the config file, the diagram lists dataset path, output path, and feature list, but I think these could be more easily set up with constructor arguments. For example, we already instantiate FeatureBuilder with the input and output paths. We could also modify FeatureBuilder parameters to include feature selections.

For more detailed and less frequently changed settings like preprocessing details or feature-specific options, there can be an option for users to provide a Config file (that can get passed into def load_config(self, filepath) function that we would need to create in the FeatureBuilder class). If a Config file is not provided, the default could be to turn everything on. I think this setup provides straightforward instantiation of FeatureBuilder with necessary inputs while allowing deep customization without cluttering the primary interface.

The Config file could look something like:

features:
  include: ["feature1", "feature2"]
  exclude: ["feature3", "feature4"]
  aggregation:
    methods: ["mean", "std"]
    columns: ["column1", "column2"]

preprocessing:
  remove_uppercase: true
  remove_punctuation: false
  custom_preprocess: "path/to/custom_script.py"

embeddings:
  force_regeneration: true

Users could then load their Config file after instantiating the FeatureBuilder like fb.load_config(path) and before calling featurize() on it.

I'm still not sure I understand why users, after downloading the package, would need to run the featurize.py file in the package. What tests would it run, and why would users need it? I thought users would create their separate code file and import and instantiate FeatureBuilder there. featurize.py as a standalone script from the package wouldn't have information about the user's input/output paths, feature selections, etc, so I am also a bit confused on how it would run / what purpose it would serve.

If there is a need for users to be able to run a script from our package after installing it, I found a popular UI for this would be using entry points. This way, users can just execute the script in the command line.

xehu commented 7 months ago

I'm still not sure I understand why users, after downloading the package, would need to run the featurize.py file in the package. What tests would it run, and why would users need it? I thought users would create their separate code file and import and instantiate FeatureBuilder there. featurize.py as a standalone script from the package wouldn't have information about the user's input/output paths, feature selections, etc, so I am also a bit confused on how it would run / what purpose it would serve.

@zhouhelena I love these thoughts!! To answer the question quoted above: I think that featurize.py currently serves two purposes, which likely need to be separated:

  1. It runs the tests. When people push to our repository, I want them to have an easy way to check that the features they are building work, and confirm that nothing in the pipeline breaks. Currently, featurize.py functions as a file that really should be named run_tests.py: it declares a FeatureBuilder on the test dataset and confirms that the assertions are correct. We want to retain this in the packaged version.
  2. It is a sample for how users are supposed to use the interface. When users download our package for the first time, they won't know how to declare a FeatureBuilder or what they need to pass in. The fact that there is an existing script with a sample that they can copy/paste is nice. Users are not going to run featurize.py per se; they're going to want to read the code and copy it into whatever they're writing. In this case, I actually think we'll likely want to replace the script with something like a demo notebook that shows people what the interface and configuration options are.

I really like the idea of making the config optional, so that people don't have to pass anything in if they like the defaults, but the more advanced users can use the configs to customize what they get.

xehu commented 7 months ago

Adding image describing changes from 04/19/24 discussion:

tpm_packaging_steps