Watts-Lab / team-process-map

MIT License
0 stars 4 forks source link

Allow User to Customize Preprocessing #204

Open xehu opened 2 months ago

xehu commented 2 months ago

By default, we preprocess text columns to (1) turn words into lowercase and (2) remove punctuation. We also have some features that require special preprocessing; for example, question detection requires not removing punctuation; some of the reddit features (e.g., detection of all-caps words; detection of parentheses and emojis) requires retaining both casing and punctuation.

This is currently handled in the way that functions are called in calculate_chat_level_features.py (https://github.com/Watts-Lab/team-process-map/blob/main/feature_engine/utils/calculate_chat_level_features.py). For example, here are examples in which the feature uses the "default" preprocessing:

self.chat_data["function_words"] = self.chat_data["message"].apply(lambda x: get_function_words_in_message(x, function_word_reference = self.function_words))
self.chat_data["content_words"] = self.chat_data["message"].apply(lambda x: get_content_words_in_message(x, function_word_reference = self.function_words))
self.chat_data['message'].apply(get_politeness_strategies).apply(pd.Series)

Here are examples of functions that require custom preprocessing; message_lower_with_punc retains punctuation, and message_original retains both punctuation and capitalization:

        self.chat_data["num_all_caps"] = self.chat_data["message_original"].apply(count_all_caps)
        self.chat_data["num_links"] = self.chat_data["message_lower_with_punc"].apply(count_links)
        self.chat_data["num_reddit_users"] = self.chat_data["message_lower_with_punc"].apply(count_user_references)
        self.chat_data["num_emphasis"] = self.chat_data["message_lower_with_punc"].apply(count_emphasis)
        self.chat_data["num_bullet_points"] = self.chat_data["message_lower_with_punc"].apply(count_bullet_points)
        self.chat_data["num_numbered_points"] = self.chat_data["message_lower_with_punc"].apply(count_numbering)
        self.chat_data["num_line_breaks"] = self.chat_data["message_lower_with_punc"].apply(count_line_breaks)
        self.chat_data["num_quotes"] = self.chat_data["message_lower_with_punc"].apply(count_quotes)
        self.chat_data["num_block_quote_responses"] = self.chat_data["message_lower_with_punc"].apply(count_responding_to_someone)
        self.chat_data["num_ellipses"] = self.chat_data["message_lower_with_punc"].apply(count_ellipses)
        self.chat_data["num_parentheses"] = self.chat_data["message_lower_with_punc"].apply(count_parentheses)
        self.chat_data["num_emoji"] = self.chat_data["message_lower_with_punc"].apply(count_emojis)

However, more advanced users may have specific preprocessing beyond handling casing and punctuation. For example, in Priya's analysis of Reddit comments, she designed a custom preprocessing step that removes quoted content and hyperlinks to external information --- as it is important to her to focus on what a given user is saying, rather than quoted or external content. Thus, we should make it possible for users to pass in a customized preprocessing function. However, we need to be careful about the ways that the custom preprocessing interacts with the existing way(s) that we preprocess the data, and we may need to check that (a) the preprocessing function is valid (it yields the same number of rows; values are strings) and (b) that we gracefully manage dependencies on functions that require different types of preprocessing (e.g., if the user customizes preprocessing, they can't remove punctuation, capitalization, etc. for features that need it).

Getting Started

1. Audit which features require which types of preprocessing.

Look at how each feature is called, and log the types of preprocessing required --- does it need casing? Punctuation? Both? Neither? Log it into the feature dependencies: https://github.com/Watts-Lab/team-process-map/issues/209

2. Create flags in the FeatureBuilder Constructor for how we want to handle preprocessing globally.

Once we log this information, we can develop structured representations / flags for users to customize their preprocessing. For example, we might create something like this:

preprocessing:
  remove_uppercase: true
  remove_punctuation: false
  custom_preprocess: "path/to/custom_script.py"

... where remove_uppercase and remove_lowercase are flags that set how preprocessing is done globally, with exceptions only for the features that have specific dependencies on alternative ways of preprocessing. (NOTE: Here, we may have a design decision for how we handle such exceptions more gracefully!)

3. Ensure changes in the constructor flow through the rest of the logic.

4. (Advanced Feature) Add the option for users to pass in their customer preprocessor.

Once we get to this point, we need to add checks to make sure the preprocessor works as expected and doesn't interact with the existing / default preprocessing in weird ways (see above).