Open xehu opened 1 year ago
Chunking logically based on time is really important for some of the stuff I have done and, using LIWC I have had to insert special characters to split on which is a pain. This would be a great feature! A separate way of thinking about this (which might be easier to implement) would be to allow users to define chunks of text to be investigated separately by including a new column for time period that is populated with either numbers (1,2,3) or labels (practice, performance 1, performance 2, etc.). For example, I have text from lab experiments and I may want to have all the conversation from the practice period to be analyzed separately from the performance period. Those periods may not be based on the same amount of time for each group so some label might be necessary, hence the extra column.
Currently, we have some code in the featurebuilder that allows us to analyze only the first X% of chats, where X is defined relative to the number of chats in the conversation:
Code initializing the FeatureBuilder class:
Then, in
get_first_pct_of_chat()
, we truncate conversations to the first X%:This Issue proposes doing so using the time in the conversation, rather than the percentage of chats. That is, similar to how we have now implemented time chunking based on time (see: https://github.com/Watts-Lab/team-process-map/issues/139), we would look at the first X% (time-wise) of a conversation, using the timestamps.