Closed xehu closed 3 months ago
Noting that we actually also have a hard-coded check in get_temporal_features
that checks whether we have timestamp_start and timestamp_end columns:
if {'timestamp'}.issubset(self.chat_data.columns):
self.chat_data["time_diff"] = get_time_diff(self.chat_data,"timestamp")
elif {'timestamp_start', 'timestamp_end'}.issubset(self.chat_data.columns):
self.chat_data["time_diff"] = get_time_diff_startend(self.chat_data)
if {'timestamp'}.issubset(self.chat_data.columns):
self.chat_data["time_diff"] = get_time_diff(self.chat_data,"timestamp")
elif {'timestamp_start', 'timestamp_end'}.issubset(self.chat_data.columns):
self.chat_data["time_diff"] = get_time_diff_startend(self.chat_data)
We will need to build this into our check of the timestamp column, and identify whether a user is passing in a single timestamp, or two columns (one for when the timestamp starts, and one for when it ends).
@xehu I was adding flexible columns and found that in several files such as https://github.com/Watts-Lab/team-process-map/blob/main/feature_engine/utils/calculate_conversation_level_features.py , we append 'conversation_num' when initializing: self.input_columns.append('conversation_num')
. However, to enable customized column name I added self.conversation_id_col = conversation_id_col
, which was also default to 'conversation_num'. Is there a specific reason to manually add this column? Should I remove this line?
@sundy1994 The answer to this question comes from the source of where self.input_columns
comes from. Basically, there is an assumption that any other columns that come with the chat level dataframe may be additional inputs or dependent variables that come from the user but not us. For example, imagine that I have data with the following columns:
Only the four columns marked with an asterisk (*) would be consumed and summarized by our system. We basically want to save the names of the additional columns that came in with the "original" data and ensure that, in later stages, we do not want to summarize them as part of the conversation-level features --- because our system didn't generate them, and they are likely unrelated to the conversation!
So, this explains line 93-94 in feature_builder.py:
# Input columns are the columns that come in the raw chat data
self.input_columns = self.chat_data.columns
And this also explains the following line:
# Denotes the columns that can be summarized from the chat level, onto the conversation level.
self.input_columns = list(input_columns)
self.input_columns.append('conversation_num')
self.columns_to_summarize = [column for column in self.chat_data.columns \
if (column not in self.input_columns) and pd.api.types.is_numeric_dtype(self.chat_data[column])]
self.summable_columns = ["num_words", "num_chars", "num_messages"]
Basically, these lines of code is saying:
So, to answer your question more succinctly: if we eventually retain the feature in which it is possible to pass in multiple grouping keys and have the system autogenerate a grouping key, then we DO need to retain this line of code.
Thanks @xehu for the explanation! I see why we need it now.
My confusion was that we have been assuming "conversation_num" exists in self.chat_data
, which is now conversation_id_col
in my code but it's default to "conversation_num" too. As you mentioned, input_columns
comes from self.chat_data.columns
, so it seems there'll be 2 "conversation_num"s in input_columns
after appending. So I guess doing this would make more sense:
if 'conversation_num' not in self.input_columns:
self.input_columns.append('conversation_num')
What's more, in preprocess_conversation_columns()
from preprocess.py
, apparently the column conversation_num" is added from grouping. I understand it's when conversation_id
in {'gameId', 'roundId', 'stageId'}
or {'batch_num', 'round_num'}
, but it looks like "conversation_num" is really a special col that we are generating in both scenarios. Lines like df['conversation_num'] = df[conversation_id]
makes me doubt if I should default conversation_id_col
to "conversation_num".
We generates new columns all the time, such as "message_lower_with_punc", which is very natural. But I think these special names shouldn't be accepted if we want users to customize some column names, at least those are similar. For now I just want to clarify if those "conversation_num"s are the same thing, and if it'll conflict with the real "conversation_id".
Noting that we actually also have a hard-coded check in
get_temporal_features
that checks whether we have timestamp_start and timestamp_end columns:if {'timestamp'}.issubset(self.chat_data.columns): self.chat_data["time_diff"] = get_time_diff(self.chat_data,"timestamp") elif {'timestamp_start', 'timestamp_end'}.issubset(self.chat_data.columns): self.chat_data["time_diff"] = get_time_diff_startend(self.chat_data) if {'timestamp'}.issubset(self.chat_data.columns): self.chat_data["time_diff"] = get_time_diff(self.chat_data,"timestamp") elif {'timestamp_start', 'timestamp_end'}.issubset(self.chat_data.columns): self.chat_data["time_diff"] = get_time_diff_startend(self.chat_data)
We will need to build this into our check of the timestamp column, and identify whether a user is passing in a single timestamp, or two columns (one for when the timestamp starts, and one for when it ends).
Per this issue, I have 2 possible solutions in mind.
timestamp_col: str = 'timestamp'
and startend: bool = False
. The default values represent single timestamp. If startend = True
, the 2 timestamp cols will be created as f'{timestamp_col}_start
and f'{timestamp_col}_end
, so by default: 'timestamp_start' and 'timestamp_end'
. It doesn't allow users to define 2 timestamps separately, but I assume no one will give unrelated start/end column names.timestamp_cols: str ='timestamp'
. If 2 timestamps are needed, users can specify timestamp_cols='timestamp_start, timestamp_end'
, and we can do timestamp_cols.split(',')
to check the number of columns. Alternatively, it can be defined as a list. The downside is the scenario that more than 2 columns are found, which can be prevented by raising errors. Moreover, the order must be stated in the documentation.I prefer the first solution personally. Does any of these two make sense to you?
So I guess doing this would make more sense: if 'conversation_num' not in self.input_columns: self.input_columns.append('conversation_num')
Yup, adding the 'if' statement makes sense to me; I think it's cleaner to add it to the list only if we had to generate it ourselves.
I understand it's when conversation_id in {'gameId', 'roundId', 'stageId'} or {'batch_num', 'round_num'}
Technically, and more broadly, it's when the conversation_id is not one column, but a list of columns, and you have to group by multiple things. The two use cases from my previous code happen to be the ones you described, but the problem is slightly more general. The resulting column is conversation_num
, which represents the index when you do pd.groupby (and pass in the list of columns you're grouping by).
We generates new columns all the time, such as "message_lower_with_punc", which is very natural. But I think these special names shouldn't be accepted if we want users to customize some column names, at least those are similar.
I think this is an excellent point. The nice thing about the dataframe is that --- I believe --- whatever the user is passing in, even if they pass in something that has an extra column called "message_lower_with_punc," we will overwrite it, so it doesn't affect the functionality of the system (but it may disappoint the user). Perhaps we can add an easy check for this? Just look to see if the special column names we generate exist in the dataframe the user passed in, and simply print a warning that says something like: "WARNING: the original dataframe contains columns that have identical names as some columns generated by the featurizer. These names will be overwritten in the output data."
Another way of preventing this from happening (or, rather, make it less likely) is to prepend our generated columns with something like "SYS." so then our columns will be "SYS.message_lower_with_punc," to denote it as something generated by our system. And if the user just so happens to have a column with such a strange name, we could provide them with the warning and move on.
A final way, which I prefer a little less, is to be more aggressive --- we drop all other columns except the four we need, or enforce the user to provide data with EXACTLY the four input columns. I actually find this slightly less convenient as a social scientist; in general, I like my independent and dependent variables to be in the same dataframe, and often the additional columns are something like dependent variables associated with the groups. It would add additional hassle to force the user to merge the DV's back in. Right now, the workflow would allow the user to generate features for their communication data, and then explore relationships with other variables or interest directly in the output dataframe. I like that aspect of the design.
Lines like df['conversation_num'] = df[conversation_id] makes me doubt if I should default conversation_id_col to "conversation_num".
For now I just want to clarify if those "conversation_num"s are the same thing, and if it'll conflict with the real "conversation_id".
Seriously, thanks for making me think more deeply about this --- I appreciate your attention to detail! The short answer is that, in every instance of 'conversation_num' that we use in the code, it is referring to the unique identifier of a conversation. The meaning of the column is exactly the same.
With regard to the line that you are referring to, it's part of preprocess_conversation_columns
, and this is the full function:
def preprocess_conversation_columns(df, conversation_id = None, cumulative_grouping = False, within_task = False):
"""
Preprocesses conversation data by removing special characters from column names and assigning a conversation number.
:param df: The DataFrame containing conversation data.
:type df: pd.DataFrame
:param conversation_id: The column name to use for assigning conversation numbers. Must be one of {'gameId', 'roundId', 'stageId'}.
:type conversation_id: str, optional
:param cumulative_grouping: Whether to group data cumulatively based on the conversation_id.
:type cumulative_grouping: bool, optional
:param within_task: Used only if cumulative_grouping is True, to specify if grouping is within the task.
:type within_task: bool, optional
:return: The preprocessed DataFrame with a conversation number column.
:rtype: pd.DataFrame
"""
# remove all special characters from df
df.columns = df.columns.str.replace('[^A-Za-z0-9_]', '', regex=True)
# If data is grouped by batch/round, add a conversation num
if {'batch_num', 'round_num'}.issubset(df.columns):
df['conversation_num'] = df.groupby(['batch_num', 'round_num']).ngroup()
df = df[df.columns.tolist()[-1:] + df.columns.tolist()[0:-1]] # make the new column first
if ({'gameId', 'roundId', 'stageId'}.issubset(df.columns) and conversation_id in {'gameId', 'roundId', 'stageId'}):
if(cumulative_grouping):
df = create_cumulative_rows(df, conversation_id, within_task)
df['conversation_num'] = df['cumulative_Id'] # set it to be the cumulative grouping
else:
df['conversation_num'] = df[conversation_id] # set it to the desired grouping
return(df)
The second half of the function essentially handles a special case. Imagine that you have multiple grouping variables in your data; for example, you have three columns, {'gameId', 'roundId', 'stageId'}
or {'batch_num', 'round_num'}
. In this case, you do not have a single conversation identifier, perhaps because the same group had multiple conversations (e.g., the same "batch" of people spoke together in multiple "rounds," or the same group of people in a game played several "rounds," each of which had several "stages").
In such a case, you have 3 possible options:
df['conversation_num'] = df[conversation_id]
; we are saying that we just treat one of the columns as our main identifier, which has always been named conversation_num
, and we move on. This line of code would be rendered obselete by the new ability to pass in a custom name for the conversation ID; it was previously necessary because the only conversation ID we were looking for was conversation_num
.df.groupby(['batch_num', 'round_num']).ngroup()
.gameId
, several roundIds
occur, and within each roundId
, several stageIds
occur, and it all happens sequentially. Therefore, we can do what I called a "cumulative grouping," where basically we group within the nested level, but include all chats that occurred up to a given point in time; for example, if you are in Round 1, Stage 2, your cumulative grouping would also include a duplication of chats from Round 1, Stage 1.Hopefully this explanation makes it clear why those lines exist and provides a good way forward. To summarize:
conversation_num
that groups by the list (basically, Option 2 above). Then we would manually set the conversation_id to the generated variable.conversation_num
is always the conversation identifier. I think all of the confusion can be attributed to the fact that I wanted to handle the case (in two specific datasets) where there was more than one conversational identifier. The new behavior (allowing the user to set any column as the identifier; perhaps adding back the option to group by a list?) should resolve this.Per this issue, I have 2 possible solutions in mind.
define 2 variables: timestamp_col: str = 'timestamp' and startend: bool = False. The default values represent single timestamp. If startend = True, the 2 timestamp cols will be created as f'{timestamp_col}_start and f'{timestamp_col}_end, so by default: 'timestamp_start' and 'timestamp_end'. It doesn't allow users to define 2 timestamps separately, but I assume no one will give unrelated start/end column names. define 1 variable: timestamp_cols: str ='timestamp'. If 2 timestamps are needed, users can specify timestamp_cols='timestamp_start, timestamp_end', and we can do timestamp_cols.split(',') to check the number of columns. Alternatively, it can be defined as a list. The downside is the scenario that more than 2 columns are found, which can be prevented by raising errors. Moreover, the order must be stated in the documentation.
Yes, all this completely makes sense! I actually slightly prefer the second option, where the user either passes in a single string (or a list with 1 item?) or a list with exactly 2 items, one for the start string name and one for the end string name. I think it is a little more robust and makes fewer assumptions. What if the user decides to give their timestamps crazy names? or uses other naming patterns such as:
.... etc. So I feel like I'd rather just have the user tell us what to look for. But my strong opinions are always open to persuasion --- and I get that raising an error + documentation will be annoying. My take on it is this: in my experience, I ALWAYS have to look up the order of whether certain functions are re.findall(pattern, repl
) or repl, pattern
; this happens, and it's not the end of the world, and it's better to make fewer assumptions.
Yes, all this completely makes sense! I actually slightly prefer the second option, where the user either passes in a single string (or a list with 1 item?) or a list with exactly 2 items, one for the start string name and one for the end string name. I think it is a little more robust and makes fewer assumptions. What if the user decides to give their timestamps crazy names? or uses other naming patterns such as:
- time_START, time_END
- timestamp_beginning, timestamp_ending
- started_speaking, finished_speaking
.... etc. So I feel like I'd rather just have the user tell us what to look for. But my strong opinions are always open to persuasion --- and I get that raising an error + documentation will be annoying. My take on it is this: in my experience, I ALWAYS have to look up the order of whether certain functions are re.findall(
pattern, repl
) orrepl, pattern
; this happens, and it's not the end of the world, and it's better to make fewer assumptions.
Sounds reasonable to me! I'll stick to one variable timestamp_cols
which can be either string or list.
One quick question: why do we calculate time_diff
twice in get_temporal_features()
?
Haha great catch, I think that's actually just a bug, making a PR to fix that now!
Basic Requirements
The system generally requires three key columns in order to process a conversation:
conversation ID
or unique identifier of a conversation;message
column, containing the text of each utterance;user/speaker
column, containing the identifier of the person making the utterance. Optionally, the user can also provide atimestamp
column, which can be flexibly formatted as a DateTime object, string, integer (Unix Time), or number (e.g., number of seconds elapsed since the beginning of a conversation).Advanced Requirements
In more advanced cases, users may have multiple conversation identifiers. For example, consider a case where the same team "meets" multiple times and has multiple discussions. There may be one
team ID
that identifies the same team, and anothersession ID
that identifies each distinct discussion.Users may therefore want to group by different columns; in some analyses, users may wish to analyze all conversations by the same team as a single "unit" (grouping only by the
team ID
). In other analyses, users may wish to examine conversations at the unit of a single "meeting" (grouping by theteam ID
AND thesession ID
). In some cases, users want to consider all the sessions "up until the present" (grouping the conversation such that we group by bothteam ID
andsession ID
, but all the chats in a conversation are cumulative).All of these options should be included as additional supported parameters; we may expect something like the following:
grouping_keys
(list): a list of conversation keys that we are grouping the conversation by (and this could have multiple names ---session ID
,round ID
,stage ID
, and others are all possibilities)cumulative_groupings
(boolean): a flag for whether the grouping is cumulative or not.Getting Started
Here are a couple pointers for getting started on this Issue:
1. Identify places in the code that uses specific column names.
The first step to addressing this Issue is to document places in the code in which we have hard-coded specific column names. For example, the
assert_key_columns_present
column inpreprocess.py
(https://github.com/Watts-Lab/team-process-map/blob/main/feature_engine/utils/preprocess.py) specifically checks for certain column names:Other parts of the script also apply similar logic to specific grouping keys (e.g.,
batch_num
andround_num
); for example, inpreprocess_conversation_columns
:Moreover, as we calculate chat level features, we are looking for the specific column,
message
: https://github.com/Watts-Lab/team-process-map/blob/main/feature_engine/utils/calculate_chat_level_features.pyAnd when the features are summarized, we are looking for
conversation_num
as the specific grouping key: https://github.com/Watts-Lab/team-process-map/blob/main/feature_engine/utils/calculate_conversation_level_features.pyThis even appears in many features, such as
word_mimicry.py
; here, we use the columnconversation_num
to measure the extent to which people speak similarly within a given conversation (https://github.com/Watts-Lab/team-process-map/blob/main/feature_engine/features/word_mimicry.py):Here, in
turn_taking_features.py
, we specifically look for the columnspeaker_nickname
when grouping turns by a speaker's ID: https://github.com/Watts-Lab/team-process-map/blob/main/feature_engine/features/turn_taking_features.py2. Clean up the FeatureBuilder Constructor.
This is the current FeatureBuilder Constructor in
feature_builder.py
(https://github.com/Watts-Lab/team-process-map/blob/main/feature_engine/feature_builder.py):We already have an option in which we are passing in
conversation_id
, and thecumulative_grouping
andwithin_task
options are initial steps towards implementing the advanced requirement (of being able to have different options with grouping). However, they are largely designed with a specific dataset in mind (and that dataset had specific keys, likestageId
,roundId
, andgameId
), and even though we passconversation_id
in as a parameter, we end up using the hardcoded string "message" throughout the codebase.Thus, while we have something of a start here, addressing this Issue will require designing thoughtful parameters that capture more generalizable use cases.
3. Follow the logic through the rest of the system.
Once we update the parameters in the constructor, what's next would be following the logic of the FeatureBuilder (e.g., preprocessing, calculating the vectors) and ensuring that the changes flow through the rest of the system. For example, rather than checking for the specific column names in
preprocess.py
, we'd now check that the parameters are non-empty and point to valid columns (containing the conversation identifier, speaker identifier, and messages).