Watts-Lab / team-process-map

MIT License
0 stars 4 forks source link

Flexibly Check Required Columns #140

Open xehu opened 11 months ago

xehu commented 11 months ago

Basic Requirements

The system generally requires three key columns in order to process a conversation:

  1. A conversation ID or unique identifier of a conversation;
  2. A message column, containing the text of each utterance;
  3. A user/speaker column, containing the identifier of the person making the utterance. Optionally, the user can also provide a timestamp column, which can be flexibly formatted as a DateTime object, string, integer (Unix Time), or number (e.g., number of seconds elapsed since the beginning of a conversation).

Advanced Requirements

In more advanced cases, users may have multiple conversation identifiers. For example, consider a case where the same team "meets" multiple times and has multiple discussions. There may be one team ID that identifies the same team, and another session ID that identifies each distinct discussion.

Users may therefore want to group by different columns; in some analyses, users may wish to analyze all conversations by the same team as a single "unit" (grouping only by the team ID). In other analyses, users may wish to examine conversations at the unit of a single "meeting" (grouping by the team ID AND the session ID). In some cases, users want to consider all the sessions "up until the present" (grouping the conversation such that we group by both team ID and session ID, but all the chats in a conversation are cumulative).

All of these options should be included as additional supported parameters; we may expect something like the following:

Getting Started

Here are a couple pointers for getting started on this Issue:

1. Identify places in the code that uses specific column names.

The first step to addressing this Issue is to document places in the code in which we have hard-coded specific column names. For example, the assert_key_columns_present column in preprocess.py (https://github.com/Watts-Lab/team-process-map/blob/main/feature_engine/utils/preprocess.py) specifically checks for certain column names:

def assert_key_columns_present(df):
    # remove all special characters from df
    df.columns = df.columns.str.replace('[^A-Za-z0-9_]', '', regex=True)

    # Assert that key columns are present
    if {'conversation_num', 'message', 'speaker_nickname'}.issubset(df.columns):
        print("Confirmed that data has `conversation_num`, `message`, and `speaker_nickname` columns!")
        # ensure no NA's in essential columns
        df['message'] = df['message'].fillna('')
        df['conversation_num'] = df['conversation_num'].fillna(0)
        df['speaker_nickname'] = df['speaker_nickname'].fillna(0)
    else:
        print("One of `conversation_num`, `message`, or `speaker_nickname` is missing! Raising error...")
        print("Columns available: ")
        print(df.columns)
        raise KeyError

Other parts of the script also apply similar logic to specific grouping keys (e.g., batch_num and round_num); for example, in preprocess_conversation_columns:

def preprocess_conversation_columns(df, conversation_id = None, cumulative_grouping = False, within_task = False):
    # remove all special characters from df
    df.columns = df.columns.str.replace('[^A-Za-z0-9_]', '', regex=True)

    # If data is grouped by batch/round, add a conversation num
    if {'batch_num', 'round_num'}.issubset(df.columns):
        df['conversation_num'] = df.groupby(['batch_num', 'round_num']).ngroup()
        df = df[df.columns.tolist()[-1:] + df.columns.tolist()[0:-1]] # make the new column first
    if ({'gameId', 'roundId', 'stageId'}.issubset(df.columns) and conversation_id in {'gameId', 'roundId', 'stageId'}):
        if(cumulative_grouping):
            df = create_cumulative_rows(df, conversation_id, within_task)
            df['conversation_num'] = df['cumulative_Id'] # set it to be the cumulative grouping
        else:
            df['conversation_num'] = df[conversation_id] # set it to the desired grouping

    return(df)

Moreover, as we calculate chat level features, we are looking for the specific column, message: https://github.com/Watts-Lab/team-process-map/blob/main/feature_engine/utils/calculate_chat_level_features.py

self.chat_data['dale_chall_score'] = self.chat_data['message'].apply(lambda x: dale_chall_helper(x, easy_words = self.easy_dale_chall_words))

And when the features are summarized, we are looking for conversation_num as the specific grouping key: https://github.com/Watts-Lab/team-process-map/blob/main/feature_engine/utils/calculate_conversation_level_features.py

        self.conv_data = pd.merge(
            left=self.conv_data,
            right=get_turn(self.chat_data.copy()),
            on=["conversation_num"],
            how="inner"
        )

This even appears in many features, such as word_mimicry.py; here, we use the column conversation_num to measure the extent to which people speak similarly within a given conversation (https://github.com/Watts-Lab/team-process-map/blob/main/feature_engine/features/word_mimicry.py):

def mimic_words(df, on_column):
  word_mimic = [[]]
  for i in range(1, len(df)):
    if df.loc[i, "conversation_num"] == df.loc[i-1, "conversation_num"]: # only do this if they're in the same conversation
      word_mimic.append([x for x in df.loc[i, on_column] if x in df.loc[(i-1),on_column]])
    else:
      word_mimic.append([])
  return word_mimic

Here, in turn_taking_features.py, we specifically look for the column speaker_nickname when grouping turns by a speaker's ID: https://github.com/Watts-Lab/team-process-map/blob/main/feature_engine/features/turn_taking_features.py

def count_turns(input_data):
    temp_turn_count = 1
    consolidated_actions = []

    start_index = input_data.index[0] + 1
    end_index = len(input_data) + input_data.index[0]

    for turn_index in range(start_index, end_index):

        if input_data["speaker_nickname"][turn_index] == input_data["speaker_nickname"][turn_index - 1]:
            temp_turn_count += 1
        else:
            consolidated_actions.append((input_data["speaker_nickname"][turn_index - 1], temp_turn_count))
            temp_turn_count = 1

        if turn_index == max(range(start_index, end_index)):
            consolidated_actions.append((input_data["speaker_nickname"][turn_index], temp_turn_count))

    assert sum(x[1] for x in consolidated_actions) == len(input_data)

    df_consolidated_actions = pd.DataFrame(columns=["speaker_nickname", "turn_count"], data=consolidated_actions)

    return df_consolidated_actions

2. Clean up the FeatureBuilder Constructor.

This is the current FeatureBuilder Constructor in feature_builder.py(https://github.com/Watts-Lab/team-process-map/blob/main/feature_engine/feature_builder.py):

class FeatureBuilder:
    def __init__(
            self, 
            input_file_path: str, 
            vector_directory: str,
            output_file_path_chat_level: str, 
            output_file_path_user_level: str,
            output_file_path_conv_level: str,
            analyze_first_pct: list = [1.0], 
            turns: bool=True,
            conversation_id = None,
            cumulative_grouping = False, 
            within_task = False
        )

We already have an option in which we are passing in conversation_id, and the cumulative_grouping and within_task options are initial steps towards implementing the advanced requirement (of being able to have different options with grouping). However, they are largely designed with a specific dataset in mind (and that dataset had specific keys, like stageId, roundId, and gameId), and even though we pass conversation_id in as a parameter, we end up using the hardcoded string "message" throughout the codebase.

Thus, while we have something of a start here, addressing this Issue will require designing thoughtful parameters that capture more generalizable use cases.

3. Follow the logic through the rest of the system.

Once we update the parameters in the constructor, what's next would be following the logic of the FeatureBuilder (e.g., preprocessing, calculating the vectors) and ensuring that the changes flow through the rest of the system. For example, rather than checking for the specific column names in preprocess.py, we'd now check that the parameters are non-empty and point to valid columns (containing the conversation identifier, speaker identifier, and messages).

xehu commented 2 weeks ago

Noting that we actually also have a hard-coded check in get_temporal_features that checks whether we have timestamp_start and timestamp_end columns:

if {'timestamp'}.issubset(self.chat_data.columns):
            self.chat_data["time_diff"] =  get_time_diff(self.chat_data,"timestamp") 
        elif {'timestamp_start', 'timestamp_end'}.issubset(self.chat_data.columns):
            self.chat_data["time_diff"] =  get_time_diff_startend(self.chat_data)
        if {'timestamp'}.issubset(self.chat_data.columns):
            self.chat_data["time_diff"] =  get_time_diff(self.chat_data,"timestamp") 
        elif {'timestamp_start', 'timestamp_end'}.issubset(self.chat_data.columns):
            self.chat_data["time_diff"] =  get_time_diff_startend(self.chat_data)

We will need to build this into our check of the timestamp column, and identify whether a user is passing in a single timestamp, or two columns (one for when the timestamp starts, and one for when it ends).

sundy1994 commented 1 week ago

@xehu I was adding flexible columns and found that in several files such as https://github.com/Watts-Lab/team-process-map/blob/main/feature_engine/utils/calculate_conversation_level_features.py , we append 'conversation_num' when initializing: self.input_columns.append('conversation_num'). However, to enable customized column name I added self.conversation_id_col = conversation_id_col, which was also default to 'conversation_num'. Is there a specific reason to manually add this column? Should I remove this line?

xehu commented 1 week ago

@sundy1994 The answer to this question comes from the source of where self.input_columns comes from. Basically, there is an assumption that any other columns that come with the chat level dataframe may be additional inputs or dependent variables that come from the user but not us. For example, imagine that I have data with the following columns:

Only the four columns marked with an asterisk (*) would be consumed and summarized by our system. We basically want to save the names of the additional columns that came in with the "original" data and ensure that, in later stages, we do not want to summarize them as part of the conversation-level features --- because our system didn't generate them, and they are likely unrelated to the conversation!

So, this explains line 93-94 in feature_builder.py:

        # Input columns are the columns that come in the raw chat data
        self.input_columns = self.chat_data.columns

And this also explains the following line:

 # Denotes the columns that can be summarized from the chat level, onto the conversation level.
        self.input_columns = list(input_columns)
        self.input_columns.append('conversation_num')
        self.columns_to_summarize = [column for column in self.chat_data.columns \
                                     if (column not in self.input_columns) and pd.api.types.is_numeric_dtype(self.chat_data[column])]
        self.summable_columns = ["num_words", "num_chars", "num_messages"]

Basically, these lines of code is saying:

xehu commented 1 week ago

So, to answer your question more succinctly: if we eventually retain the feature in which it is possible to pass in multiple grouping keys and have the system autogenerate a grouping key, then we DO need to retain this line of code.

sundy1994 commented 1 week ago

Thanks @xehu for the explanation! I see why we need it now.

My confusion was that we have been assuming "conversation_num" exists in self.chat_data, which is now conversation_id_col in my code but it's default to "conversation_num" too. As you mentioned, input_columns comes from self.chat_data.columns, so it seems there'll be 2 "conversation_num"s in input_columns after appending. So I guess doing this would make more sense:

if 'conversation_num' not in self.input_columns: 
     self.input_columns.append('conversation_num')

What's more, in preprocess_conversation_columns() from preprocess.py, apparently the column conversation_num" is added from grouping. I understand it's when conversation_id in {'gameId', 'roundId', 'stageId'} or {'batch_num', 'round_num'}, but it looks like "conversation_num" is really a special col that we are generating in both scenarios. Lines like df['conversation_num'] = df[conversation_id] makes me doubt if I should default conversation_id_col to "conversation_num".

We generates new columns all the time, such as "message_lower_with_punc", which is very natural. But I think these special names shouldn't be accepted if we want users to customize some column names, at least those are similar. For now I just want to clarify if those "conversation_num"s are the same thing, and if it'll conflict with the real "conversation_id".

sundy1994 commented 1 week ago

Noting that we actually also have a hard-coded check in get_temporal_features that checks whether we have timestamp_start and timestamp_end columns:

if {'timestamp'}.issubset(self.chat_data.columns):
            self.chat_data["time_diff"] =  get_time_diff(self.chat_data,"timestamp") 
        elif {'timestamp_start', 'timestamp_end'}.issubset(self.chat_data.columns):
            self.chat_data["time_diff"] =  get_time_diff_startend(self.chat_data)
        if {'timestamp'}.issubset(self.chat_data.columns):
            self.chat_data["time_diff"] =  get_time_diff(self.chat_data,"timestamp") 
        elif {'timestamp_start', 'timestamp_end'}.issubset(self.chat_data.columns):
            self.chat_data["time_diff"] =  get_time_diff_startend(self.chat_data)

We will need to build this into our check of the timestamp column, and identify whether a user is passing in a single timestamp, or two columns (one for when the timestamp starts, and one for when it ends).

Per this issue, I have 2 possible solutions in mind.

  1. define 2 variables: timestamp_col: str = 'timestamp' and startend: bool = False. The default values represent single timestamp. If startend = True, the 2 timestamp cols will be created as f'{timestamp_col}_start and f'{timestamp_col}_end, so by default: 'timestamp_start' and 'timestamp_end'. It doesn't allow users to define 2 timestamps separately, but I assume no one will give unrelated start/end column names.
  2. define 1 variable: timestamp_cols: str ='timestamp'. If 2 timestamps are needed, users can specify timestamp_cols='timestamp_start, timestamp_end', and we can do timestamp_cols.split(',') to check the number of columns. Alternatively, it can be defined as a list. The downside is the scenario that more than 2 columns are found, which can be prevented by raising errors. Moreover, the order must be stated in the documentation.

I prefer the first solution personally. Does any of these two make sense to you?

xehu commented 1 week ago

So I guess doing this would make more sense: if 'conversation_num' not in self.input_columns: self.input_columns.append('conversation_num')

Yup, adding the 'if' statement makes sense to me; I think it's cleaner to add it to the list only if we had to generate it ourselves.

I understand it's when conversation_id in {'gameId', 'roundId', 'stageId'} or {'batch_num', 'round_num'}

Technically, and more broadly, it's when the conversation_id is not one column, but a list of columns, and you have to group by multiple things. The two use cases from my previous code happen to be the ones you described, but the problem is slightly more general. The resulting column is conversation_num, which represents the index when you do pd.groupby (and pass in the list of columns you're grouping by).

We generates new columns all the time, such as "message_lower_with_punc", which is very natural. But I think these special names shouldn't be accepted if we want users to customize some column names, at least those are similar.

I think this is an excellent point. The nice thing about the dataframe is that --- I believe --- whatever the user is passing in, even if they pass in something that has an extra column called "message_lower_with_punc," we will overwrite it, so it doesn't affect the functionality of the system (but it may disappoint the user). Perhaps we can add an easy check for this? Just look to see if the special column names we generate exist in the dataframe the user passed in, and simply print a warning that says something like: "WARNING: the original dataframe contains columns that have identical names as some columns generated by the featurizer. These names will be overwritten in the output data."

Another way of preventing this from happening (or, rather, make it less likely) is to prepend our generated columns with something like "SYS." so then our columns will be "SYS.message_lower_with_punc," to denote it as something generated by our system. And if the user just so happens to have a column with such a strange name, we could provide them with the warning and move on.

A final way, which I prefer a little less, is to be more aggressive --- we drop all other columns except the four we need, or enforce the user to provide data with EXACTLY the four input columns. I actually find this slightly less convenient as a social scientist; in general, I like my independent and dependent variables to be in the same dataframe, and often the additional columns are something like dependent variables associated with the groups. It would add additional hassle to force the user to merge the DV's back in. Right now, the workflow would allow the user to generate features for their communication data, and then explore relationships with other variables or interest directly in the output dataframe. I like that aspect of the design.

Lines like df['conversation_num'] = df[conversation_id] makes me doubt if I should default conversation_id_col to "conversation_num".

For now I just want to clarify if those "conversation_num"s are the same thing, and if it'll conflict with the real "conversation_id".

Seriously, thanks for making me think more deeply about this --- I appreciate your attention to detail! The short answer is that, in every instance of 'conversation_num' that we use in the code, it is referring to the unique identifier of a conversation. The meaning of the column is exactly the same.

With regard to the line that you are referring to, it's part of preprocess_conversation_columns, and this is the full function:

def preprocess_conversation_columns(df, conversation_id = None, cumulative_grouping = False, within_task = False):
    """
    Preprocesses conversation data by removing special characters from column names and assigning a conversation number.

    :param df: The DataFrame containing conversation data.
    :type df: pd.DataFrame
    :param conversation_id: The column name to use for assigning conversation numbers. Must be one of {'gameId', 'roundId', 'stageId'}.
    :type conversation_id: str, optional
    :param cumulative_grouping: Whether to group data cumulatively based on the conversation_id.
    :type cumulative_grouping: bool, optional
    :param within_task: Used only if cumulative_grouping is True, to specify if grouping is within the task.
    :type within_task: bool, optional
    :return: The preprocessed DataFrame with a conversation number column.
    :rtype: pd.DataFrame
    """

    # remove all special characters from df
    df.columns = df.columns.str.replace('[^A-Za-z0-9_]', '', regex=True)

    # If data is grouped by batch/round, add a conversation num
    if {'batch_num', 'round_num'}.issubset(df.columns):
        df['conversation_num'] = df.groupby(['batch_num', 'round_num']).ngroup()
        df = df[df.columns.tolist()[-1:] + df.columns.tolist()[0:-1]] # make the new column first
    if ({'gameId', 'roundId', 'stageId'}.issubset(df.columns) and conversation_id in {'gameId', 'roundId', 'stageId'}):
        if(cumulative_grouping):
            df = create_cumulative_rows(df, conversation_id, within_task)
            df['conversation_num'] = df['cumulative_Id'] # set it to be the cumulative grouping
        else:
            df['conversation_num'] = df[conversation_id] # set it to the desired grouping

    return(df)

The second half of the function essentially handles a special case. Imagine that you have multiple grouping variables in your data; for example, you have three columns, {'gameId', 'roundId', 'stageId'} or {'batch_num', 'round_num'}. In this case, you do not have a single conversation identifier, perhaps because the same group had multiple conversations (e.g., the same "batch" of people spoke together in multiple "rounds," or the same group of people in a game played several "rounds," each of which had several "stages").

In such a case, you have 3 possible options:

  1. Choose one item in the list as the conversation identifier and ignore the others. This is the purpose of the line df['conversation_num'] = df[conversation_id]; we are saying that we just treat one of the columns as our main identifier, which has always been named conversation_num, and we move on. This line of code would be rendered obselete by the new ability to pass in a custom name for the conversation ID; it was previously necessary because the only conversation ID we were looking for was conversation_num.
  2. Group by the identifiers. Basically, we are saying, give me a unique combination of the list of grouping variables, and treat that collection as an identiifer. For example, we can do something like: df.groupby(['batch_num', 'round_num']).ngroup().
  3. Group by the identifiers, but do so in a more cumulative way. This is the behavior that, as discussed, we may need to deprecate, because it is so specific to the original data that I wrote this function for. Essentially, the code assumes that within each gameId, several roundIds occur, and within each roundId, several stageIds occur, and it all happens sequentially. Therefore, we can do what I called a "cumulative grouping," where basically we group within the nested level, but include all chats that occurred up to a given point in time; for example, if you are in Round 1, Stage 2, your cumulative grouping would also include a duplication of chats from Round 1, Stage 1.

Hopefully this explanation makes it clear why those lines exist and provides a good way forward. To summarize:

xehu commented 1 week ago

Per this issue, I have 2 possible solutions in mind.

define 2 variables: timestamp_col: str = 'timestamp' and startend: bool = False. The default values represent single timestamp. If startend = True, the 2 timestamp cols will be created as f'{timestamp_col}_start and f'{timestamp_col}_end, so by default: 'timestamp_start' and 'timestamp_end'. It doesn't allow users to define 2 timestamps separately, but I assume no one will give unrelated start/end column names. define 1 variable: timestamp_cols: str ='timestamp'. If 2 timestamps are needed, users can specify timestamp_cols='timestamp_start, timestamp_end', and we can do timestamp_cols.split(',') to check the number of columns. Alternatively, it can be defined as a list. The downside is the scenario that more than 2 columns are found, which can be prevented by raising errors. Moreover, the order must be stated in the documentation.

Yes, all this completely makes sense! I actually slightly prefer the second option, where the user either passes in a single string (or a list with 1 item?) or a list with exactly 2 items, one for the start string name and one for the end string name. I think it is a little more robust and makes fewer assumptions. What if the user decides to give their timestamps crazy names? or uses other naming patterns such as:

.... etc. So I feel like I'd rather just have the user tell us what to look for. But my strong opinions are always open to persuasion --- and I get that raising an error + documentation will be annoying. My take on it is this: in my experience, I ALWAYS have to look up the order of whether certain functions are re.findall(pattern, repl) or repl, pattern; this happens, and it's not the end of the world, and it's better to make fewer assumptions.

sundy1994 commented 1 week ago

Yes, all this completely makes sense! I actually slightly prefer the second option, where the user either passes in a single string (or a list with 1 item?) or a list with exactly 2 items, one for the start string name and one for the end string name. I think it is a little more robust and makes fewer assumptions. What if the user decides to give their timestamps crazy names? or uses other naming patterns such as:

  • time_START, time_END
  • timestamp_beginning, timestamp_ending
  • started_speaking, finished_speaking

.... etc. So I feel like I'd rather just have the user tell us what to look for. But my strong opinions are always open to persuasion --- and I get that raising an error + documentation will be annoying. My take on it is this: in my experience, I ALWAYS have to look up the order of whether certain functions are re.findall(pattern, repl) or repl, pattern; this happens, and it's not the end of the world, and it's better to make fewer assumptions.

Sounds reasonable to me! I'll stick to one variable timestamp_cols which can be either string or list.

One quick question: why do we calculate time_diff twice in get_temporal_features()?

xehu commented 1 week ago

Haha great catch, I think that's actually just a bug, making a PR to fix that now!