Closed xehu closed 7 months ago
Example of repo with config file: https://github.com/xehu/tpm-horse-race-modeling
Curate a minimum list of columns required for a user to run
- [ ] speaker_nickname, message and conversation_num The original features require columns to have these specific names, whereas a potential user's data might not contain these exact columns.
Proposed Solution:
Curate a minimum list of columns required for a user to run
See related issue: https://github.com/Watts-Lab/team-process-map/issues/140
@PriyaDCosta -- I think the broader point behind this is that we have to make a design decision; how "strict" do we want to be in terms of demanding the user set up their dataframe in a way that works for our system, and how "flexible" do we want to be?
At one end of the extreme, we can simply say, "OK, this is our formatting requirement. We ask you to rename your columns before passing your data to us, sorry."
At the other end of the extreme, we can say, "We just need something that refers to a speaker ID, something that refers to a conversation ID, and something that refers to the message inside. We do not care what you name them, as long as you tell us the names."
Figuring out where we want to be in this spectrum --- what do we want to ask the user to configure? What is ALLOWED to be configured? --- is a design decision that we need to have a clear philosophy on.
A possible resource for thinking about modularizing the featurizer: https://maartengr.github.io/BERTopic/algorithm/algorithm.html#visual-overview
I think BERTopic has fantastic resources/documentation on this, and this could be a good model for us; the idea would be that we have strong defaults, but users should be able to override them (i.e., they can use a different way of vectorizing if they want, or a different way of aggregating if they want...)
The open question is: how do we want to design the interface for that? How do we connect it smoothly? Where do we let users put in their preferences? Config file, or something else?
Interface design thoughts:
For the config file, the diagram lists dataset path, output path, and feature list, but I think these could be more easily set up with constructor arguments. For example, we already instantiate FeatureBuilder with the input and output paths. We could also modify FeatureBuilder parameters to include feature selections.
For more detailed and less frequently changed settings like preprocessing details or feature-specific options, there can be an option for users to provide a Config file (that can get passed into def load_config(self, filepath) function that we would need to create in the FeatureBuilder class). If a Config file is not provided, the default could be to turn everything on. I think this setup provides straightforward instantiation of FeatureBuilder with necessary inputs while allowing deep customization without cluttering the primary interface.
The Config file could look something like:
features:
include: ["feature1", "feature2"]
exclude: ["feature3", "feature4"]
aggregation:
methods: ["mean", "std"]
columns: ["column1", "column2"]
preprocessing:
remove_uppercase: true
remove_punctuation: false
custom_preprocess: "path/to/custom_script.py"
embeddings:
force_regeneration: true
Users could then load their Config file after instantiating the FeatureBuilder like fb.load_config(path)
and before calling featurize()
on it.
featurize.py
I'm still not sure I understand why users, after downloading the package, would need to run the featurize.py
file in the package. What tests would it run, and why would users need it? I thought users would create their separate code file and import and instantiate FeatureBuilder there. featurize.py
as a standalone script from the package wouldn't have information about the user's input/output paths, feature selections, etc, so I am also a bit confused on how it would run / what purpose it would serve.
If there is a need for users to be able to run a script from our package after installing it, I found a popular UI for this would be using entry points. This way, users can just execute the script in the command line.
I'm still not sure I understand why users, after downloading the package, would need to run the featurize.py file in the package. What tests would it run, and why would users need it? I thought users would create their separate code file and import and instantiate FeatureBuilder there. featurize.py as a standalone script from the package wouldn't have information about the user's input/output paths, feature selections, etc, so I am also a bit confused on how it would run / what purpose it would serve.
@zhouhelena I love these thoughts!! To answer the question quoted above: I think that featurize.py
currently serves two purposes, which likely need to be separated:
featurize.py
functions as a file that really should be named run_tests.py
: it declares a FeatureBuilder on the test dataset and confirms that the assertions are correct. We want to retain this in the packaged version.featurize.py
per se; they're going to want to read the code and copy it into whatever they're writing. In this case, I actually think we'll likely want to replace the script with something like a demo notebook that shows people what the interface and configuration options are.I really like the idea of making the config optional, so that people don't have to pass anything in if they like the defaults, but the more advanced users can use the configs to customize what they get.
Adding image describing changes from 04/19/24 discussion:
Cleaning up Technical Debt
Many of the featurizer options in our current codebase had been specific to one of our original datasets or motivated by a specific project --- and these should be cleaned up: turned into a more generalizable format (i.e., giving people a very clear and explicit choice/customization) or removed entirely (i.e., being clear about what formats we do / do not accept).
Here is a (non-comprehensive) list of examples:
preprocess.py
where you group by batch and round num:But these names are very specific to the juries dataset! One can imagine that people can have any name for identifying their column. There is another example where we hard-code specific parameters for the multi-task dataset (in which the names are "stageId," "roundId"); this is the
create_cumulative_rows
function.check_embeddings
files checks whether embeddings exist for a dataset, and generates them if they do not yet exist. However, if they already exist, there is no way to force regenerating them (unless you delete the files). This design assumed that datasets do not change; however, in reality, they do change! We add new rows to test datasets all the time, and we therefore make it possible to specify when to regenerate embeddings.featurize.py
. Increasingly,featurize.py
is turning into more of a testing file --- it's just one script that is used to call the FeatureBuilder. Users will be calling the FeatureBuilder in iPython notebooks, or in their own scripts / setups. We may consider renaming the file something likerun_tests.py
, and putting it into the testing folder as the primary script that runs our tests. We no longer need all the (currently commented-out) calls to other datasets; those belong in the respective projects using each dataset, and not in the FeatureBuilder, which we intend to separately package up.Open Discussion: What's the User Experience of Using our Featurizer?
More broadly, these examples point to a larger problem: what do we want the user experience of the featurizer to look like? Let's use this issue to come up with a list of what we want to get cleaned up and how we want to design things.
Proposed Solution: Add a YAML/Config File.
What if we asked for the user's preferences via a YAML/Config file, then built a featurizer to the user's specifications? This might create a clean interface to allow the user to state all of their preferred options in a single, clean location, and for us to abstract away some of the decisions that are otherwise hardcoded or reflect technical debt.
The flow would look something like this:
Another way of thinking about this
We can think about the overall "flow" of our featurizer as having the following steps:
At each step of this flow, we currently have specific design decisions / settings --- for example, preprocessing with lowercase/punctuation removed by default; using SBERT embeddings (rather than any other type of embedding); keeping all features (rather than some specific subset); and using mean, max, min, std to aggregate. What if, at a bare minimum, we just think through ways that we can let the user have more control over this --- to, in other words, control whether they want to use the defaults, or use something else?
What we expect from the user
We would expect the user to give us something like the following:
pip install
) and easily set up dependenciesWhat the user should expect from us