This Issue has two key goals:

Keep Track of Feature Dependencies: As we make it possible for people to pick and choose different features, we should also track the "dependency graph" --- for example, some features (e.g., hedging, information exchange) depend on some of the other features (e.g., lexical word counts). When we allow people to select which features they want, we should ensure that their selection doesn't break any underlying dependency logic. Issue https://github.com/Watts-Lab/team-process-map/issues/209 proposes saving structured data around what the dependencies are.
Keep track of additional processes/procedures for features: Some features, like sentiment detection, require pre-training word vectors; others, like named entity recognition, may require the user to pass in additional files. There should therefore be checks to ensure that we apply the necessary procedures to generate the requested features --- while also speeding up the process by eliminating unnecessary processing if the user doesn't request a certain feature.

Getting Started

Before staring on the issue, make sure that we have addressed https://github.com/Watts-Lab/team-process-map/issues/209 and https://github.com/Watts-Lab/team-process-map/issues/202 --- in other words, we have infrastructure that tracks feature dependencies and we have a way of ingesting the list of features that users do and do not want.
We calculate the chat levels features in utils/calculate_chat_level_features.py (https://github.com/Watts-Lab/team-process-map/blob/main/feature_engine/utils/calculate_chat_level_features.py). In this file, leverage the user-selected feature list and the dependencies to generate the features as efficiently as possible. Right now, since we generate all features by default, we simply go through and call each feature one at a time. Can we do more in this file to track which dependencies are needed, and call only the features the user wants?

 def calculate_chat_level_features(self) -> pd.DataFrame:
        """
            This is the main driver function for this class.

        RETURNS:
            (pd.DataFrame): The chat level dataset given to this class during initialization along with 
                            new columns for each chat level feature.
        """

        # Concat sentiment BERT markers (done through preprocessing)
        self.concat_bert_features()

        # Text-Based Basic Features
        self.text_based_features()

        # "Basic" Info Exchange Feature -- z-scores of content minus first pronouns
        self.info_exchange()

        # lexical features
        self.lexical_features()

        # Other lexical features
        self.other_lexical_features()

        # Word Mimicry
        self.calculate_word_mimicry()

        # Hedge Features
        self.calculate_hedge_features()

        # TextBlob Sentiment features
        self.calculate_textblob_sentiment()

....

@xehu This looks quite straight-forward. I was reading Helena's branch on task 3 and I think she already addressed this issue. Basically, she labeled each feature as chat level or conversation level and feed them accordingly in feature builder. In terms of dependencies, are we talking about sub-features required for a feature, or required packages to be installed?

Yup, I think Helena’s branch, if merged, essentially addresses this task! Right now her branches touches on this AND the next task (allowing users to choose features — it has the start of giving people arguments they can pass in for features they want to include or exclude).By dependencies, I mean that some features require specific preprocessing steps (for example, word vectors) that need to be run before the feature is computed; so, in that sense, it’s the sub-features required for a feature (requirements.txt handles the packages we need installed, and in my mind, it’s a little less important to track exactly which feature uses exactly which package). Having a sense of what those steps are is required so that we can skip preprocessing steps should the user not ask for any of the features that depend on the step.If you look at the channel with Helena, she’s shared a doc (and I’ve added some content) where we’ve basically decided focus on the two big preprocessing steps (word vectors and sentiment), and to run all the lexical features by default (since it’s cheap and easy to do). However, we are thinking that we’ll default to NOT running the word vectors (generating self.vect_data) because it takes a long time to initially preprocess, and we can save some time if the user doesn’t need the feature. This, we’ll want to track which features depend on these steps and run the preprocessing steps (and the features) depending on those choices.On Jul 14, 2024, at 11:54 PM, yuxuanzh @.***> wrote: @xehu This looks quite straight-forward. I was reading Helena's branch on task 3 and I think she already addressed this issue. Basically, she labeled each feature as chat level or conversation level and feed them accordingly in feature builder. In terms of dependencies, are we talking about sub-features required for a feature, or required packages to be installed?

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

Yup, I think Helena’s branch, if merged, essentially addresses this task! Right now her branches touches on this AND the next task (allowing users to choose features — it has the start of giving people arguments they can pass in for features they want to include or exclude).By dependencies, I mean that some features require specific preprocessing steps (for example, word vectors) that need to be run before the feature is computed; so, in that sense, it’s the sub-features required for a feature (requirements.txt handles the packages we need installed, and in my mind, it’s a little less important to track exactly which feature uses exactly which package). Having a sense of what those steps are is required so that we can skip preprocessing steps should the user not ask for any of the features that depend on the step.If you look at the channel with Helena, she’s shared a doc (and I’ve added some content) where we’ve basically decided focus on the two big preprocessing steps (word vectors and sentiment), and to run all the lexical features by default (since it’s cheap and easy to do). However, we are thinking that we’ll default to NOT running the word vectors (generating self.vect_data) because it takes a long time to initially preprocess, and we can save some time if the user doesn’t need the feature. This means we’ll want to track which features depend on these steps and run the preprocessing steps (and the features) depending on those choices.On Jul 14, 2024, at 11:54 PM, yuxuanzh @.***> wrote: @xehu This looks quite straight-forward. I was reading Helena's branch on task 3 and I think she already addressed this issue. Basically, she labeled each feature as chat level or conversation level and feed them accordingly in feature builder. In terms of dependencies, are we talking about sub-features required for a feature, or required packages to be installed?

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

Watts-Lab / team_comm_tools

Check for Feature Requirements #203

Getting Started