Watts-Lab / team_comm_tools

An open-source Python library that turns multiparty conversational data into social-science backed features.
https://teamcommtools.seas.upenn.edu/
MIT License
3 stars 4 forks source link

🕵🏻‍♀️Team Process Mapping Weekly Updates #5

Closed xehu closed 9 months ago

xehu commented 2 years ago

Week of June 29

Lots of work is currently in progress on all fronts! You all saw a pretty up-to-date summary of our work at the AdBoard meeting, but here is a quick rundown of what we are working on:

  1. New Features: Discursive diversity has been merged in, and we also have a new BERT-backed emotion feature (about to be merged in), and topic-over-time features are in progress.
  2. Modeling Updates: Currently, our baseline random forest models are doing way better than before (R^2 = 0.19), and Yashveer has been working on a custom neural network architecture to see if we can boost performance, as well as better integrate our domain knowledge. We plan to use the layered structure of the NN to represent specific knowledge about the relationships between features (that we know from the behavioral theory), so this approach seems promising.

Week of June 15

We've had a really productive week, and the three biggest modeling challenges that we are currently tackling are:

  1. Modeling features over time;
  2. Modeling topics of conversation (and how they vary over time);
  3. Modeling collections of features, by yoking conversational measurements to the underlying behavioral constructs they represent.

We're currently hard at work on all these fronts, but I just happen to be writing this update a bit last-minute (since I'm also working on --- but not yet finished with --- making the slides for our AdBoard meeting!

Thus, to give you a taste of our progress, I want to show Shruti's AMAZING work on Discursive Diversity. She has been exploring breaking conversations down into meaningful chunks, and seeing how discursive diversity changes across each "period" in the conversation. Below are some plots of Discursive Diversity for the Juries dataset, playing around with diving the conversation into 2, 3 and 4 chunks: image image (1) image (2)

As well as similar plots for the CSOP (II) dataset: image (3) image (4) image (5)

These plots are interesting because they show the evolution of the discursive diversity feature over time (thus allowing us to measure the modulation of the feature --- a key predictor theorized in the original paper). We plan to apply a similar analysis to other features as well, as well as examine how the topic of discussion evolves over the course of the conversation.

One area in which this analysis might be applicable is determining whether we need to "cut short" the conversation inputs --- for example, as @JamesPHoughton noted last week, positivity may be such a predictive feature because people are congratulating each other at the end (e.g., 'great work, team!') leading to label leakage on our prediction of interest (performance). By modeling features over time, we hope to find the features that are truly predictive of performance before the interaction is over.

Week of June 8

This update covers the last two weeks, as sadly I missed the last lab meeting due to a family emergency. However, I've been really grateful to have a wonderful and supportive team in Yashveer, Priya, Nikhil, and Shruti, and we have a truly exciting (and quite long) round of updates!

(Update 1 of 3) Modeling

On the modeling front, we've recently merged our full "baseline model" into the main branch, with a ton of credit to Yashveer for his incredible work on this part of the project.

This model accounts for:

Note that there are some pretty major upgrades since the last update:

The Multi-Task Joint Model

Showing our best-performing model: Random Forest

Metrics Train Set: R2: 0.5139 MAE: 0.4561 MSE: 0.4959 RMSE: 0.7042 Validation Set: R2: 0.1056 MAE: 0.7238 MSE: 0.8814 RMSE: 0.9388 Test Set: R2: 0.114 MAE: 0.6406 MSE: 0.7266 RMSE: 0.8524

Screenshot 2023-06-07 at 10 56 05 PM Screenshot 2023-06-07 at 10 56 32 PM

Interpretation What types of communication predicts team success? Turns out, it's important, above all, to be positive --- the average use of positive language is currently our model's top predictor. In a close second is being terse (a feature that's come up in previous versions of our model, too).

Among the highlights:

(Tangent...) How do we interpret those confusing SHAP values in the picture? Last meeting, I got some questions about how to interpret the red and blue dots. I've done more reading on this, including the original NeurIPS paper where the SHAP approach was introduced. The tl;dr is that the SHAP value is calculated on a datapoint-by-datapoint basis, in which we calculate the partial dependence ("contribution" for the final prediction) of each feature in a point X, and for each point in the dataset.

I found this blog post and the author's talk helpful, too; in the diagram below, taken from the blog post, you see that the x-axis describes the SHAP value (whether this feature contributes positively or negatively to the target being predicted), and the coloring of the point describes the value of the feature (whether, in order to be more predictive, the value of the feature should be higher or lower).

So, for example, in the plot below, being a female in a lower pclass (e.g., 1st or 2nd, as opposed to 3rd), who paid a higher fare and had a younger age, makes a person more likely to survive the Titanic. Each one of these points would represent a single passenger.

Screenshot 2023-06-07 at 11 02 12 PM

Predicting and testing out-of-sample: a CSOP Case Study

Another fun analysis is to "zoom in" specifically on CSOP --- because I have two separate CSOP datasets. One is from Abdullah's paper with Mohammed Alsobay, and another is from Abdullah's paper with Nak Won Rim. This means that we can train a model on one CSOP (the one with Mohammed), and predict out-of-sample on a "natural" test set (the one with Nak Won Rim).

Showing our best-performing model: Random Forest

Metrics Train Set: R2: 0.3905 MAE: 0.5246 MSE: 0.5881 RMSE: 0.7669 Validation Set: R2: 0.2249 MAE: 0.7276 MSE: 1.0108 RMSE: 1.0054 Test Set: R2: 0.1628 MAE: 0.6971 MSE: 0.8354 RMSE: 0.914

Screenshot 2023-06-07 at 11 26 31 PM Screenshot 2023-06-07 at 11 27 27 PM

Here, we get similar interpretable features: it's good to be positive, and it's good to talk less. In addition to this, it's also specifically good to say more substance (fewer "stopwords").

Notably, we're getting R^2 = 0.16 out-of-sample prediction on a natural test set from an entirely different run of the experiment! I also fit a different version of the model in which I mix the data from the two experiments. There, out of sample test goes to R^2 = 0.25.

And some general explorations

Also, while on a 15-hour flight to China, I played around with trying to visualize the data in other ways --- for example, seeing how the PCA of the conversations map to the PCA of the Task Map.

They actually look remarkably similar: Screenshot 2023-06-07 at 11 49 33 PM Screenshot 2023-06-07 at 11 49 47 PM

I'm still mulling about how exactly to quantify how compellingly the task features and conversation features relate to each other. But I thought this was cool!

(Update 2 of 3) New Conversational Features

This week, we merged in two sets of features: from Shruti, we've incorporated a set of politeness features from Cristian Danescu-Niculescu-Mizil's ConvoKit package; and from Priya, we've incorporated features accounting for time (e.g., the pace of responding to each other). After incorporating ConvoKit, the out-of-sample R^2 for the CSOP dataset went up by 0.02 --- even though the features don't appear in the "top" ones, they're moving the needle.

We're now officially working on deep features, including using S-BERT to create sentence embeddings of all the messages (which just completed yesterday). More on these features in the weeks to come!

(Update 3 of 3) The "Mapping" Part

Screenshot 2023-06-07 at 11 38 53 PM

Nikhil's role on the team has been to take stock of our many features and literatures and group them into sensible behavioral categories; thus, creating a structured representation of the processes we've surveyed from the literature, and the many ways we capture these processes using the computational features. Above is a small snippet of some of his work.

One way we plan on incorporating this work is to "bundle" interpretable or related features together in our models. Thus, instead of treating all features as totally independent, we might group together all features that relate to "politeness," or "positivity," and so on --- allowing us to more closely align with the organizational behavior literature (as well as account for all the correlations between features). This will be an area in which we expand for the next few weeks!

(Epilogue) On how to turn this whole thing into a paper

I think this is where I'd love to hear from all of you (@duncanjwatts, and everyone else!!) --- because now we're seeing some preliminary results, and the pieces of a paper are coming together. I'm also looking ahead to:

... in which there are some chances to put this work out there (or commitment devices, really, for me to assemble the pieces). I'd love to hear your thoughts!

FIN

That's it for updates --- this was a long one, but hopefully you can tell that it's been productive here at TPM.

Week of May 25

This week, we onboarded our new team member, Shruti, who is now getting started with implementing her first feature! Shruti will be helping integrate some of the conversation features that come out of ConvoKit (Cristian DNM's package) into our pipeline. We've been talking about integrating this package for a while now, but thus far, it's taken more time and engineering than expected to build up the entire featurization and analysis pipeline, as well as build in some of the initial lexical features. But now, we're excited to move into more complex/advanced inference/models.

After Yashveer helped build out the model playground, I spent most of this week with my hands in the weeds, making edits and playing around with building models. Notably, the pipeline now ingests a collection of dataframes across multiple tasks, standardizes all dependent variables (within-task), and then fits models using both the conversation features and task features. (For now, we're just using categorical dummy variables per task, but we'll soon import the Task Map and use those features instead!) Additionally, the pipeline can allow the user to specify different types of models (currently we support two: XGBoost and LASSO).

Here is an XGBoost model, fit on 5 datasets: (1) Juries (predicting % agreement); (2) and (3) Two CSOP datasets (predicting efficiency); (4) Estimation (predicting post-discussion error percent, relative to true value); and (5) Divergent Association Task (predicting the score):

Image

And here is a LASSO model, fit on the same data:

Image

Image

R^2 for these models is better for XGBoost than for LASSO, by about 10fold; 0.0463 (XGB) versus 0.0045 (LASSO).

While the models return different top features, the interpretations are somewhat similar --- across tasks, it pays to talk less (a consistent finding in our data that we have discussed before, perhaps because talking less trades off with actually solving the task, such as in CSOP, and perhaps due to other issues, such as information overload, social influence, group polarization, etc.). Additionally, it pays to be positive --- having more positivity/positive words is a fairly strong indicator of success across tasks. Finally, both models show that asking too many questions (perhaps an indication of confusion?) is negatively associated with performance.

There's a ton more work to do with regard to better accounting for an integrating task variables --- currently, including them as predictors does nothing more than account for different performance distributions for each task --- but I think this is a step in the right direction, especially relative to last week.

Week of May 18

Initial Model Explorations This week, Yashveer completed an initial 'model playground' where we can take the features generated and output some out-of-the-box models (in this case, we feed all the features into XGBoost and use the shapley values to visualize which ones are most important). We are modeling the problem as a regression; that is, we're trying to predict the target performance metric as a continuous variable. Additionally, we z-score all performance metrics.

I've been playing around inside this playground (pun intended), and here are 3 outputs, for (1) the Juries dataset; (2) the original CSOP dataset; and (3) a newer CSOP dataset:

Image

A couple notes at first glance:

From a review by Marlow et al. (2018) --- and thanks to Nikhil for finding it! ---

A high volume of communication will inevitably impart some useful information, but it may also include irrelevant information that may distract from the more important details. In line with the literature on information overload (e.g., Edmunds & Morris, 2000), we suggest that a high frequency of communication may contain distracting, irrelevant information that may interfere with the ability of individuals to set priorities appropriately. Further, based on cognitive load theory (Van Merrienboer & Sweller, 2005), a large volume of communication may lead to difficulties in accurately remembering and comprehending more relevant, previously received information.

I think that being able to connect the outputs of the models to theories like this is definitely a step in the right direction for this project.

There are also lots of technical challenges/next steps to figure out:

So, there's lots to chew on, and lots more to do.

Feature Updates This week, we added in 4 new features, found a bug with one of our lexicon features, and are continuing the process of building. Same old, same old (except it's definitely exciting that we continue to grow our feature pool!)

Some stats...

I also had a really productive chat with Will about ways that we'll want to engineer the pipeline a little differently to account for the deliberation data. Part of it has to do with pre-processing (audio versus text features), and part of it has to do with different features of interest (e.g., modeling things like polarization and political orientation, and also outputting features at the user level, which we don't currently do --- I documented this in https://github.com/Watts-Lab/team-process-map/issues/113).

Recruiting Updates We were able to finalize our recruiting process and determine a final candidate to join us for the summer! @shapeseas has once again been such a huge help. Onboarding to follow!

GPT Updates No updates on this, this week. Thanks to Timothy Dorr for reaching out to chat about it! I've been so busy with code reviews for feature updates and playing around with the model that this fell off my radar again. :( But hopefully I will get back to this soon!

Week of May 11

As summer kicks off, we are making progress on all directions:

Much of this week has also been working on setting up a few big tasks for the days/weeks ahead...

Week of April 20

Taking inspiration from @linneagandhi, I also put together a schedule for the summer, in conjunction with my team: https://docs.google.com/spreadsheets/d/1xcuFaq9176czL9wD5MHon_7jRGYiJACjENfg35Ru1wk/edit#gid=0

There are 5 main activities that will take place over the coming summer:

  1. Literature Review
  2. Feature Building
  3. Model Building
  4. Infrastructure
  5. Writing/Presenting.

By the official end of semester/start of summer, we hope to have all outstanding features wrapped. Priya has continued to hit a few NLP-related bugs, so we are hoping to close those out soon. Yashveer is continuing the EDA process, and we will --- per our discussion last week --- define a standardized requirement for choosing a dependent variable, and then choose a DV for each dataset and get to modeling by week of May 8. We will spend most of May iterating on the model (on the Model Building side) and building out deeper features --- e.g., exploring GPT and other deep learning applications beyond the lexical features we have focused on thus far --- on the Feature Building side. Lit Review will be tailored to support the Feature Building Efforts.

Starting in June, we will transition to infrastructure cleanup/changes --- we will need to audit what requirements need to change after we build out our models and iterate a bit. And by late June/early July, we'll start outlining, presenting (we have a poster at IC2S2), and will finish a write-up by the 2nd year paper deadline in September.

Week of April 13

As we head towards the end of the semester, our primary goal is to try to clean up any outstanding features and start standardizing our datasets in a way that allows us to build a single model to answer the question, "what different conversation features make a team successful across different tasks?"

One challenge along the way is that we do not have a standard dependent variable across the different tasks. "Success" in a jury deliberation means something very different than "success" in CSOP. And even within each game, there are often multiple objectives that can potentially be the dependent variable.

Accordingly, Yashveer has been working on visualizing the featurized conversations and comparing how different clusters of conversations have differences across the various outcome variables. Below is a visual of just the juries dataset, projected in 2 dimensions using t-SNE (right-hand-side). At the left-hand-side is a heatmap of how the different clusters performed on the 4 different dependent variables of interest in the juries dataset: majority percentage, number flipped, flipped percentage, and number of votes.

I'd love to get your feedback on how we should go about selecting the right dependent variable for each of the datasets, so that we can get a unified perspective on what "success" looks like across so many different tasks.

Image

Summer planning updates:

Week of March 30

This week I had the chance to sit down for an hour with Dean Knox and Gus Cooney, who are publishing their paper about the CANDOR corpus (analyzing features in dyadic video conversation) very soon. I've gave them the rundown of the Team Process Mapping project, and they seem really enthusiastic about the idea of applying the TPM model on extracting lots of conversational features and mapping team features + task features --> conversational features --> performance. Dean called the idea of studying the team process as 'the ultimate mediator' --- after all, everything these groups do is mediated by their process (which we capture in our data!). Dean and Gus have a lot of experience with video analysis, as their previous work relates to video. So, it was great to discuss this as a potential future direction and to open up a possible avenue for collaboration. Dean also had some thoughts on some issues that I've been pondering, but have not resolved: he suggested ways that I can model the data (which is multi-layered), and we also discussed ideas about whether/how to incorporate human labeling (which, as I've noted before, is really tricky because human labels are noisy and unreliable). Overall, I'm just glad I've met one more person who finds this work exciting and meaningful!

The rest of the machinery is rolling: building more features, fixing a few bugs (Duncan, I fixed the bug that crashed the code when I demo'ed it to you! Haha). It's a boring update that belies the amount of work happening under the hood.

The primary challenge remains that --- as we progress towards more ML-heavy features, we lack labeled data to train models on our own dataset. The best we can do (for now) is to try to find extant labeled datasets (e.g., Twitter, Yelp, etc.) and then run inference on our own data. We don't think we have better recourse than to either find such datasets for now; the alternatives are (1) setting up a labeling pipeline on our own data, which we have discussed repeatedly in the past, but we had decided that humans are bad at these judgment tasks, so we felt the benefit was dubious; and (2) try something like GPT-labeling (which recent evidence shows can lead to a cheaper rating pipeline than MTurk: https://arxiv.org/pdf/2303.15056.pdf). More ideas on this front are welcome!

Looking ahead to the summer, it seems like 2 RA's will definitely be staying on. Yashveer will be staying with the team (doing an Independent Study), and Priya has committed to staying as well (as a paid RA). We're working on finalizing those plans, and Yashveer already has an excellent draft of his independent study proposal.

Week of March 23

We have had an incredibly productive week. Here's a sense of what we've been up to!

First, data. We now have 5 datasets (Juries; CSOP; PGG; and two versions of an Estimation task). We have another 1-2 datasets coming from a collaborator, Nak Won Rim, who should be able to send over his data within the next week. We can make basic plots of those features along a PCA projection of our feature space, and (naively) observe that there are some interesting discussion patterns.

Image

The next step here is to continue to explore this data, as well as to start thinking about ways to start modeling it. I'll want to figure out a way to create a hierarchical model, as I now have a bunch of features for each team (e.g., team size, any composition variables such as social perceptiveness) x a bunch of features for different tasks (from the Task Map) x a bunch of features for the communication process --> Performance variable. I honestly don't yet have a great sense of what these models will look like, so I'm looking forward to starting the exploration process. Thoughts/suggestions on this front are welcome!

Second, features. I merged in about a half-dozen new features this week, and I honestly have a long backlog of pull requests to review from Priya and Yuluan, who are doing a fantastic job. We're continuing to grow the set of features we're able to pull from the chats, which is great.

Third, pipeline updates. Yashveer has been doing phenomenal work helping to speed up our pipeline at all stages. He was able to take a feature-generation process that took an untenably long amount of time to run (20 hours on the full dataset), and reduce it down to 2 minutes. We're working on merging in his changes and applying them throughout the code.

Fourth, literature review. We're constantly getting in new literature from Nikhil; so far, 29 papers have been fully documented.

Week of March 16

So sorry that I missed the Spring Break meeting! But TPM has been rolling along with lots of exciting new updates.

Week of March 2

Exciting update re/ IC2S2 --- I finished (and submitted!) a writeup! You can read it here: Team_Process_Mapping__IC2S22023.pdf

Perhaps my favorite part of it is creating a little proof of concept, in which we run some of our initial features on two of our initial datasets! We're able to get a model that works pretty well, and some insights that make sense! Aaaahh!!! Here's the figure from the IC2S2 submission, with a long caption explaining some of it:

Screenshot 2023-02-26 at 10 47 42 PM

Having completed this step, we look towards:

Week of February 23

Week of February 16

Week of February 9

Our team has been working to formalize and clean up our pipeline, by standardizing features, encapsulating feature generation code into classes, and reducing redundancy. In particular, whereas our original design consolidated nearly all of the feature generation code into a single file, our new design defines two class, Chat-Level Features and Conversation-Level Features, which encapsulate the features at the different levels. All chat-level features are also automatically summarized (via mean, stdev, max, min, etc.) into the conversation level. With our new design, we can create and process all features using ~5 lines of code in the main file --- and we can scale easily to more than 1 dataset in the future.

Here is a flowchart of our process, taken from our recent meeting. This schematic looks fairly similar to the version that I presented at the meeting last week; in both cases, the high-level goal is to take raw data (far left side), preprocess it, and transform it into chat-level and conversation-level features for prediction. Image

As a view of our team members' updates:

Overall, it's been a productive week! We're spending a lot of time at the beginning of this semester doing what appear to be "cleanup" or software infrastructure set-up tasks; but as we say in Chinese, the time spent on sharpening one's axe is never wasted. I think that some of the initial ways that I had bootstrapped these systems were rapid but inefficient, and before we scale up further, establishing more robust systems for engineering are critical.

Week of February 2

Work is underway across various pieces of the Team Process Mapping pipeline! Here is a bird's eye view:

Literature Review Cleanup Much of the literature we had previously gathered had been oriented around "team processes" generally, rather than team communication processes (where we had later narrowed our scope). The task of our OIDD299 RA's is to go through our Index of papers, mark the ones that are actually related to communication, and set up a new set of inclusion criteria given our updated scope. This process is now underway; we've now audited several dozen papers over the past week, and the first pass should be done by next week.

Engineering Scaffold Updates In parallel, we are working on coding up sets of features to extract from the text-based data. As we start tackling different types of features, we are realizing that we need to implement a more rigorous engineering scaffold for how to preprocess the data; create clean, reusable functions; and document outputs with time stamps.

Design of Data Preprocessing Pipeline Some of our features are defined on a chat-based level, others on a turn based level, and still others on a conversation-based level. We want to make the engineering decision to "split out" the preprocessing and create a smooth pipeline for calculating features at various levels, and ultimately ending up with conversation-level features for the final prediction (our dependent variables take place at the conversation level). We envision implementing a pipeline that looks something like the following: Image

With this pipeline, we will...

In addition, Yashveer will be helping with further optimizations to our pipeline:

Designing the Eventual Experiment Looking ahead, the final piece that I'm thinking about is setting up the analytical pipeline for doing the causal inference we need to answer our research question. I've been really inspired by this article (https://www.science.org/doi/10.1126/sciadv.abg2652), and I'm thinking about how we want to apply the learnings to our pipeline as we look towards building the 'real' model. More ahead!

Week of January 26

Updates:

Unresolved issues:

Week of January 19 (with some combined updates along the way)

As we kick off this semester, we're looking to scale up this project! I am grateful that we have received funding from Wharton Analytics, and we will be bringing on new RA's with the grant, as well as through the OIDD 299 mentoring program. I spent a great deal of time in the last 2 weeks working on the hiring process (thank you @shapeseas for all the help!!!), and I'm excited to start the new year with a team that's double the size!

Here's a more point-by-point update of the different pieces of this project:

Project Scoping: 2nd Semester Goals

This semester, my goal is to establish the 'minimal viable product' (MVP) for the project, building towards a summer paper and a completed writeup for that first paper by the fall.

Our goal is to answer the question, 'what kind of team communications matter under what conditions?' One epiphany that I had was that this project could very easily plug into the Task Mapping paper. Specifically, right now I have what might be called a two-sided market; I have, on the one side, the space of team communications; and I have, on the other side, the space of all the tasks and contexts that might be pertinent to teams. Mapping both of these at the same time seems rather overwhelming.

My current plan [open to discussion] is to leverage the fact that (1) the initial datasets we're going off of are jury deliberation, CSOP, and ad writing, and (2) all of these are tasks that we already have mapped in the Task Map. Thus, my plan is to effectively model how the team communication changes as the task (as modeled by our Task Map) changes. The project thus plugs seamlessly into the existing Task Mapping project, and feels like a natural extension/next step. The downside is that this scope is a bit smaller than the "teams in ANY context" project that was initially envisioned. However, I will still eventually scale up our data --- e.g., bringing in new deliberation data, getting that (long-awaited) prediction team data --- and we will therefore have opportunities to expand beyond the tasks in our map in the longer run. That's still the goal!

However, my short-term goal is to have an initial version of the model (for Task Map tasks, and our limited existing set of data) built by the end of this semester, so that we can make adjustments and start writing by the summer.

Feature Building Updates: Starting Simple, then Scaling to More Complex ML Models

Priya, Yuluan, and I are going full steam ahead on starting to build new features. Currently, we're focused on a foundation set of relatively simple computational and lexical features; as our new ML RA is onboarded next week, however, I have in mind a set of more complex features, which will utilize advanced models, such as transformers.

Because we're trying to collect as many lexical features as we can, if you have access to dictionaries for such features used in any previous project, pleas let us know! See https://github.com/orgs/Watts-Lab/projects/8/views/2?pane=issue&itemId=18470903 :)

Feature Building Challenges / Decisions

As we add in new features, one question we have is whether (and how much) we should be hand-labeling. I wrote a longer blurb on this design decision, and I welcome your inputs on the GitHub Issue here: https://github.com/orgs/Watts-Lab/projects/8/views/2?pane=issue&itemId=17994477

Literature Review

Literature review has been on pause, as my 2 existing RA's have been reallocated to feature building. However, 2 of the new RA's I've recruited are going to specifically dedicated to literature review for the next couple of weeks, so I'm expecting this to re-ramp up soon.

Week of December 9

This week, we made progress in building our infrastructure:

Issues:

Something to do at all stages of the pipeline:

Week of December 2

A major update with this project is that I have down-scoped it to look primarily at communication features. There is so much to team processes --- and as Duncan rightly pointed out in one of our conversations --- so much of the team process takes place in members' heads. This is why a huge amount of the literature thus far has relied on self-report or survey measures; it's difficult to argue that we can fully computationalize all these constructs (and the endeavor is so large and daunting that it's unclear how exactly I'll finish).

So, for now, we are down-scoping to focus on communication features --- the corner of team process that people explicitly share with the world, and measuring how people interact from what they say to one another.

We're working on (1) continuing to refine our processes, and (2) cleaning up a first set of key features that we have fully "mapped" from the initial papers to a computational model.

The status of this is as follows:

Some changes to the team:

Week of November 18

I am now thinking of the project in terms of 4 phases. Right now we are working on Phase 1-3, and here are the updates for each phase:

1. Phase 1 -- Lit Review Phase ("Build the Process Map")

2. Phase 2 -- Computational Collection Phase (collect computational papers and methods to measure the proposed process features in real data)

3. Phase 3 -- Computational Implementation Phase (implement those features in code).

4. Phase 4 -- Application Phase (Analysis of Archival Data)

Week of November 11

Week of November 4

Week of October 28

Roadmap to a Better Team - Wharton Analytics Funding Proposal (November 2022).pdf

Week of October 21

Week of October 11

Image

Week of October 7

  1. Finishing touches on the progress tracker;
  2. Finishing an initial pass on the ~50 papers currently in the system;
  3. Identifying gaps in the papers and collecting new literature (currently, looking for more observational / NLP methods papers, as well as papers from 'top OB journals')

Week of September 30

Literature Review Progress This week, we've been making great progress on organizing the literature review — just making sense of the literature so far. In particular, we have been dividing the existing papers into different categories so that we have a better sense of where we are over/under-sampling, so that we can identify gaps in which we need to add more research.

Here's a screenshot of a great landing page organized by Priya:

Image

Getting more data I have met with Dean's political deliberation team, and they're very open to collaborating and sharing data. They are running a bunch of experiments collecting audio and video data for teams deliberating controversial political topics. As I consolidate a model for team processes, they are willing to give me data so that I can make predictions about the outcomes of the deliberations. This will provide a promising context for testing the model!

Week of September 23

Our RA's are onboarded, and we had a great working session this week!

Our upcoming goal is to make a pass through all of the literature that we have so far and clean it up. I had collected a few dozen papers, and Eric Zhong had also collected a few dozen papers over the weeks that he's been working with us. In this pass, we want to synthesize what we have and identify the gaps in our current literature collection that we will want to fill in our next stage.

Inspired by Linnea, I've been working on a document (https://docs.google.com/document/d/1Rj2if3G2jngHU_PmFdcI3SiEOGrHNiW7IRlQs_A5GuI/edit) that tries to outline a few key ideas to guide our literature review process:

This document is very much a work in progress, and will continue to evolve as we clean up the existing literature and consolidate what we want.

A final piece is that I've been looking for data to connect to this project one it reaches the appropriate stage. After we "map out" the team process, I want to use the models of team process to make predictions on different datasets of teams interacting. Since I know that getting good data is half the battle too, I have been reaching out to people to ask for data/collaborations. This week, I made a connection through Dean Knox's lab and am meeting with Will Shultz (https://willschulz.com) a PhD student at Princeton who is collecting some data involving team deliberation.

Week of September 12

markwhiting commented 1 year ago

It would be interesting to see a short overview for this project, perhaps in the next update.

shapeseas commented 1 year ago

Excited to have your crew on board Emily! I have issued "Intro to GiHub" trainings for Priya + Candice; https://github.com/Watts-Lab/lab-setup/issues/89 https://github.com/Watts-Lab/lab-setup/issues/90

xehu commented 1 year ago

Thanks Eric for helping hire and onboard!!!

AND @markwhiting I have an 'About' page written here in case you're interested: https://github.com/Watts-Lab/team-process-map Happy to present too, though next Friday is stressful because I also have my summer paper oral defense...

linneagandhi commented 1 year ago

Cool! Reading the latest update re making predictions on Dean's data -- @xehu can you clarify why you are going to make predictions on deliberation? That sounds more like @JamesPHoughton project would inform that, but maybe more of the tasks are deliberative-ish?

linneagandhi commented 1 year ago

Also I really like that you are balancing out your sampling in the lit review! Very smart!

xehu commented 1 year ago

@linneagandhi to answer your question about why I'm using deliberation data; honestly, I was reaching out to everyone I could to figure out where I can get a dataset that satisfies the following criteria: (1) contains team communication and interaction data; (2) contains some kind of outcome data. Dean's group was enthusiastic about my pitch and offered me data in the format that I asked for. I don't have a particular interest in deliberation, and the outcomes I will use will be those that are interesting to Dean, Will, and Chris. Their interest is specific to political decision-making, so it's likely very different from what James is interested in.

More broadly, my vision for Team Process Mapping is to understand which team processes/pathways are "dominant" in different contexts. So, I see deliberation as a specific context in which I can answer the question, which team processes are most relevant in determining team outcomes? I also hope to compare these results to collaboration data from different contexts --- for example, software engineering, Turkers on Empirica tasks, and so on.

linneagandhi commented 1 year ago

@xehu Cool! Another random idea then -- how about all the data on superforecaster or general forecaster communications from Phil & Barb's work? There the DV is accuracy of prediction. I think they have a ton of chat data of folks comparing notes, arguing, etc. in teams.

xehu commented 1 year ago

@linneagandhi yup! I actually already talked to Phil about it! He said to get back to him in mid-October, so I have a reminder on my calendar to email him again on October 17 or 18, haha! But yes, this is on my radar too. Please share any other teams that potentially meet the criteria!

markwhiting commented 1 year ago

Re your question, I agree with Convo Kit, or a more general version of it like I think we might have discussed before, creating models for theoretical concepts and validating them on a per-theory level. I think thats a really interesting technique for the study of stuff like this.

duncanjwatts commented 1 year ago

Linnea (with help from Eric) is building a pipeline in Google Docs for extracting experiment metadata and data from papers. Maybe that would be helpful?

amaatouq commented 1 year ago

(Potential discussion/feedback point) Looking ahead, I'm looking for good resources to think about how to operationalize social science constructs. I have realized that this project hinges on our ability to reliably measure the team process constructs (often vague notions like "cooperation"...). Otherwise, we can't really make claims that we're "testing different causal pathways," since we are creating (potentially weak, noisy) proxies. Are there good places to start? One thought is to use Cristian Danescu-Niculescu-Mizil's work as a potential jumping-off point, as ConvoKit may be useful.

I've been thinking about somewhat related issues lately and discussed them with Matt Salganik in the context of the Fragile Families study. I am going to share a summary of the back-and-forth with Matt here:

Here’s how we are thinking about things now. Imagine there is some true outcome Y and a contaminated measure of the outcome Y. Further, imagine that there are some true predictors X and contaminated measures of these predictors X. Finally, for now, let’s assume that we don’t know much about the relationship between the true values and the contaminated measures.

If you are interested in the Pearson correlation between Y and X, and if measurement error is “classical” then you can divide the correlation of Y and X by the product of the reliabilities of X and Y (dx.doi.org/10.4135/9781412961288.n81).

If you are interested in the relationship between X* and Y, as expressed by a regression coefficient, you can do something called Simulation-Extrapoliation. Roughly, if you have a model for measurement error, then you can add more measurement error and then extrapolate back to what the relationship would be with no measurement error (jstor.org/stable/2290994). This is closely related to a large body of work in statistics on measurement error models, which focus on errors in the predictors (jstor.org/stable/2669787).

If you are interested in the relationship between X* and Y, in a predictive sense, we have not found much. This paper seems to show that it depends a lot on the details (doi.org/10.1002/sim.6498), but this paper seems overly general (dx.doi.org/10.1198.jasa.2009.tm07543).

If you are interested in the relationship between X and Y you can estimate the test-data noise ceiling. In broad strokes, it is the maximum accuracy you can predict Y given the measurement error in Y* (doi.org/10.1371/journal.pcbi.1006397). We found the figure in this blog post helpful: diedrichsenlab.org/BrainDataScience/noisy_correlation/index.htm

Genetics and polygenic scores also seem to attempt to correct measurement error, but we have not yet found any good papers on that (but I have not looked much).

Stepping back from all of this, I take away two main conclusions.

linneagandhi commented 1 year ago

Congrats on submitting your proposal!! Fingers crossed!!

linneagandhi commented 1 year ago

@xehu remind me - do you just need any conversation by a group + an outcome of that group conversation? I wonder if CSPAN hearings or Parliament hearings in the UK could work. Like, when something is put to a vote? (Just trying to be helpful if you need more data. If you don't I'll stop brainstorming :-)

xehu commented 1 year ago

@linneagandhi yup --- I'm mainly looking for group + outcome, although I'm currently considering smaller groups / teams .... when it gets to the entire parliament, and with the complexity of politics, things can get pretty complicated! That's a really good idea, though!

markwhiting commented 1 year ago

Happy to chat about system design for that problem! I think the answer is probably that it's going to be tricky no mater how you do it.

JamesPHoughton commented 1 year ago

I'd like to chat sometime about applying your feature extraction measures to the deliberation videos. Clear synergy, if I understand what's going on properly. =)

markwhiting commented 1 year ago

Synergy is rare!

Also, interested about how much you can scale your feature extraction process and if thats something we can easily apply in the other teams project?

xehu commented 1 year ago

@JamesPHoughton yes --- it's my hope that this system can be used to analyze anything that involves (1) team communication data and (2) some outcome. That is also why I am involved in the project with Dean; a synergy came up such that I was offered a chance to engage with the project so that, eventually, I'd include that data in the analysis as well. So this is in the plan!

If you have other deliberation data that could be contributed, it would be great to even splinter off and look at a deliberation-focused paper. In fact, I'm currently piloting using the old chat data from my own deliberation paper with Mark and Michael Bernstein (published in 2020)!

As you can see from these updates, though, we're currently still extracting features and engineering the pipeline; this project is new and started just this semester. So we don't yet have a system in place for actually doing a very advanced analysis, but I hope that as we advance, we will be able to scale up and create a very generative infrastructure for multiple kinds of analysis.

linneagandhi commented 1 year ago

"I suppose I should have seen this one coming --- of course there is no one-size-fits-all way of analyzing communicational data; so how do we create a unifying paradigm across all these approaches? But honestly, this issue kind of reinforces how hard it is to create a mapping "atlas" as well. It's really, really hard to create a framework that works for even one domain, much less one that extends across domains." @xehu

Love it, commiserating right alongside you!

shapeseas commented 1 year ago

@xehu love the project kickoff slide deck!

xehu commented 1 year ago

@shapeseas thank you so much!!

xehu commented 1 year ago

Notes from meeting:

xehu commented 1 year ago

Lab meeting notes:

xehu commented 1 year ago

Notes from lab meeting:

markwhiting commented 1 year ago

Cool. Hoping to get you more data soon.

shapeseas commented 1 year ago

@xehu glad that DATS proposal got approved! Could you send the final to me or point me to where it sits?

re: working dates - he can start on the project as it relates to coursework as far as I'm concerned. Or is he asking for pay between 5/9 and 5/22? If so, that may be a bit more difficult to set up quickly but we can pursue that if you strongly prefer it.

xehu commented 1 year ago

@shapeseas following up with you on slack :)

xehu commented 1 year ago

Note from DV standardization posted here: https://github.com/Watts-Lab/team-process-map/issues/57

linneagandhi commented 1 year ago

@xehu Congrats on all you've achieved so far! You're not even done with year 2!

xehu commented 1 year ago

Thank you, @linneagandhi <3 thankful to have you as an example to learn from!

markwhiting commented 1 year ago

Great! Neat to see the pipeline coming together and being used in this way. Does the model matter? As in, what about using more like autoML? You might get much stronger predictors and could use some explainability techniques to check if features or bundles of features you care about seem to be standing out as important.