Watts-Lab / team_comm_tools

An open-source Python library that turns multiparty conversational data into social-science backed features.
https://teamcommtools.seas.upenn.edu/
MIT License
3 stars 5 forks source link

Use Time-Based Chunking to Calculate Features #167

Open xehu opened 1 year ago

xehu commented 1 year ago

Currently, we have some code in the featurebuilder that allows us to analyze only the first X% of chats, where X is defined relative to the number of chats in the conversation:

Code initializing the FeatureBuilder class:

    def __init__(
            self, 
            input_file_path: str, 
            output_file_path_chat_level: str, 
            output_file_path_user_level: str,
            output_file_path_conv_level: str,
            analyze_first_pct: float=1.0
        ) -> None:

Then, in get_first_pct_of_chat(), we truncate conversations to the first X%:

    def get_first_pct_of_chat(self) -> None:
        """
            This function truncates each conversation to the first X% of rows.
        """
        chat_grouped = self.chat_data.groupby('conversation_num')
        num_rows_to_retain = pd.DataFrame(np.ceil(chat_grouped.size() * self.first_pct)).reset_index()
        chat_truncated = pd.DataFrame()
        for conversation_num, num_rows in num_rows_to_retain.itertuples(index=False):
            chat_truncated = pd.concat([chat_truncated,chat_grouped.get_group(conversation_num).head(int(num_rows))], ignore_index = True)

        self.chat_data = chat_truncated

This Issue proposes doing so using the time in the conversation, rather than the percentage of chats. That is, similar to how we have now implemented time chunking based on time (see: https://github.com/Watts-Lab/team-process-map/issues/139), we would look at the first X% (time-wise) of a conversation, using the timestamps.

jonkush commented 2 months ago

Chunking logically based on time is really important for some of the stuff I have done and, using LIWC I have had to insert special characters to split on which is a pain. This would be a great feature! A separate way of thinking about this (which might be easier to implement) would be to allow users to define chunks of text to be investigated separately by including a new column for time period that is populated with either numbers (1,2,3) or labels (practice, performance 1, performance 2, etc.). For example, I have text from lab experiments and I may want to have all the conversation from the practice period to be analyzed separately from the performance period. Those periods may not be based on the same amount of time for each group so some label might be necessary, hence the extra column.