xehu commented 2 years ago

Week of June 29

Lots of work is currently in progress on all fronts! You all saw a pretty up-to-date summary of our work at the AdBoard meeting, but here is a quick rundown of what we are working on:

New Features: Discursive diversity has been merged in, and we also have a new BERT-backed emotion feature (about to be merged in), and topic-over-time features are in progress.
Modeling Updates: Currently, our baseline random forest models are doing way better than before (R^2 = 0.19), and Yashveer has been working on a custom neural network architecture to see if we can boost performance, as well as better integrate our domain knowledge. We plan to use the layered structure of the NN to represent specific knowledge about the relationships between features (that we know from the behavioral theory), so this approach seems promising.

Week of June 15

We've had a really productive week, and the three biggest modeling challenges that we are currently tackling are:

Modeling features over time;
Modeling topics of conversation (and how they vary over time);
Modeling collections of features, by yoking conversational measurements to the underlying behavioral constructs they represent.

We're currently hard at work on all these fronts, but I just happen to be writing this update a bit last-minute (since I'm also working on --- but not yet finished with --- making the slides for our AdBoard meeting!

Thus, to give you a taste of our progress, I want to show Shruti's AMAZING work on Discursive Diversity. She has been exploring breaking conversations down into meaningful chunks, and seeing how discursive diversity changes across each "period" in the conversation. Below are some plots of Discursive Diversity for the Juries dataset, playing around with diving the conversation into 2, 3 and 4 chunks: image (1) image (2)

As well as similar plots for the CSOP (II) dataset: image (3) image (4) image (5)

These plots are interesting because they show the evolution of the discursive diversity feature over time (thus allowing us to measure the modulation of the feature --- a key predictor theorized in the original paper). We plan to apply a similar analysis to other features as well, as well as examine how the topic of discussion evolves over the course of the conversation.

One area in which this analysis might be applicable is determining whether we need to "cut short" the conversation inputs --- for example, as @JamesPHoughton noted last week, positivity may be such a predictive feature because people are congratulating each other at the end (e.g., 'great work, team!') leading to label leakage on our prediction of interest (performance). By modeling features over time, we hope to find the features that are truly predictive of performance before the interaction is over.

Week of June 8

This update covers the last two weeks, as sadly I missed the last lab meeting due to a family emergency. However, I've been really grateful to have a wonderful and supportive team in Yashveer, Priya, Nikhil, and Shruti, and we have a truly exciting (and quite long) round of updates!

(Update 1 of 3) Modeling

On the modeling front, we've recently merged our full "baseline model" into the main branch, with a ton of credit to Yashveer for his incredible work on this part of the project.

This model accounts for:

4 tasks (Jury Deliberation, CSOP, Estimation, and Divergent Association Task);
Predicts a defined "target" for each one (Juries = Majority percentage; CSOP = Efficiency; Estimation = Post-Discussion Error Percentage; DAT = Score);
Accounts for task Features using the Task Map;
Fits 4 different models (XGBoost, Random Forest, Linear Regression, LASSO);
Reports Train-Test-Validation Error (with the ability to also define a custom test set --- more on why this is so cool in a bit!).

Note that there are some pretty major upgrades since the last update:

We no longer use cross-validation to get our model metrics (e.g., R^2, MSE). As noted in a previous update, below, the R^2 values from cross-validation were confusing and totally uninterpretable (since we were showing one model, then displaying metrics for another). Here, we take the more traditional approach of fitting a model on a defined training set, and then measuring R^2 from a specified out-of-sample validation/test set.
We're actually including the Task Map features!! --- it's like worlds colliding. I've wanted to do this for some time now, so this is very exciting.

The Multi-Task Joint Model

Showing our best-performing model: Random Forest

Metrics Train Set: R2: 0.5139 MAE: 0.4561 MSE: 0.4959 RMSE: 0.7042 Validation Set: R2: 0.1056 MAE: 0.7238 MSE: 0.8814 RMSE: 0.9388 Test Set: R2: 0.114 MAE: 0.6406 MSE: 0.7266 RMSE: 0.8524

Screenshot 2023-06-07 at 10 56 05 PM

Interpretation What types of communication predicts team success? Turns out, it's important, above all, to be positive --- the average use of positive language is currently our model's top predictor. In a close second is being terse (a feature that's come up in previous versions of our model, too).

Among the highlights:

Teams who use more positive language tend to have better performance.
Teams who talk less tend to have better performance (this also applies to related metrics, like "information exchange," which measures non-first-pronoun content words, and auxiliary verbs).
The 4 top task features (Q16-Q19) are all Laughlin's features, and they capture the idea of "demonstrability." The model says that teams tend to do better on tasks that are less demonstrable (e.g., more open-ended, with many possible answers and no provable correctness).

(Tangent...) How do we interpret those confusing SHAP values in the picture? Last meeting, I got some questions about how to interpret the red and blue dots. I've done more reading on this, including the original NeurIPS paper where the SHAP approach was introduced. The tl;dr is that the SHAP value is calculated on a datapoint-by-datapoint basis, in which we calculate the partial dependence ("contribution" for the final prediction) of each feature in a point X, and for each point in the dataset.

I found this blog post and the author's talk helpful, too; in the diagram below, taken from the blog post, you see that the x-axis describes the SHAP value (whether this feature contributes positively or negatively to the target being predicted), and the coloring of the point describes the value of the feature (whether, in order to be more predictive, the value of the feature should be higher or lower).

So, for example, in the plot below, being a female in a lower pclass (e.g., 1st or 2nd, as opposed to 3rd), who paid a higher fare and had a younger age, makes a person more likely to survive the Titanic. Each one of these points would represent a single passenger.

Screenshot 2023-06-07 at 11 02 12 PM

Predicting and testing out-of-sample: a CSOP Case Study

Another fun analysis is to "zoom in" specifically on CSOP --- because I have two separate CSOP datasets. One is from Abdullah's paper with Mohammed Alsobay, and another is from Abdullah's paper with Nak Won Rim. This means that we can train a model on one CSOP (the one with Mohammed), and predict out-of-sample on a "natural" test set (the one with Nak Won Rim).

Showing our best-performing model: Random Forest

Metrics Train Set: R2: 0.3905 MAE: 0.5246 MSE: 0.5881 RMSE: 0.7669 Validation Set: R2: 0.2249 MAE: 0.7276 MSE: 1.0108 RMSE: 1.0054 Test Set: R2: 0.1628 MAE: 0.6971 MSE: 0.8354 RMSE: 0.914

Screenshot 2023-06-07 at 11 26 31 PM Screenshot 2023-06-07 at 11 27 27 PM

Here, we get similar interpretable features: it's good to be positive, and it's good to talk less. In addition to this, it's also specifically good to say more substance (fewer "stopwords").

Notably, we're getting R^2 = 0.16 out-of-sample prediction on a natural test set from an entirely different run of the experiment! I also fit a different version of the model in which I mix the data from the two experiments. There, out of sample test goes to R^2 = 0.25.

And some general explorations

Also, while on a 15-hour flight to China, I played around with trying to visualize the data in other ways --- for example, seeing how the PCA of the conversations map to the PCA of the Task Map.

They actually look remarkably similar: Screenshot 2023-06-07 at 11 49 33 PM

I'm still mulling about how exactly to quantify how compellingly the task features and conversation features relate to each other. But I thought this was cool!

(Update 2 of 3) New Conversational Features

This week, we merged in two sets of features: from Shruti, we've incorporated a set of politeness features from Cristian Danescu-Niculescu-Mizil's ConvoKit package; and from Priya, we've incorporated features accounting for time (e.g., the pace of responding to each other). After incorporating ConvoKit, the out-of-sample R^2 for the CSOP dataset went up by 0.02 --- even though the features don't appear in the "top" ones, they're moving the needle.

We're now officially working on deep features, including using S-BERT to create sentence embeddings of all the messages (which just completed yesterday). More on these features in the weeks to come!

(Update 3 of 3) The "Mapping" Part

Screenshot 2023-06-07 at 11 38 53 PM

Nikhil's role on the team has been to take stock of our many features and literatures and group them into sensible behavioral categories; thus, creating a structured representation of the processes we've surveyed from the literature, and the many ways we capture these processes using the computational features. Above is a small snippet of some of his work.

One way we plan on incorporating this work is to "bundle" interpretable or related features together in our models. Thus, instead of treating all features as totally independent, we might group together all features that relate to "politeness," or "positivity," and so on --- allowing us to more closely align with the organizational behavior literature (as well as account for all the correlations between features). This will be an area in which we expand for the next few weeks!

(Epilogue) On how to turn this whole thing into a paper

I think this is where I'd love to hear from all of you (@duncanjwatts, and everyone else!!) --- because now we're seeing some preliminary results, and the pieces of a paper are coming together. I'm also looking ahead to:

Advisory Board Presentation (6/22)
(Possible) SJDM Submission? (6/22)
IC2S2 (7/17)
Summer Paper deadline (9/4)

... in which there are some chances to put this work out there (or commitment devices, really, for me to assemble the pieces). I'd love to hear your thoughts!

FIN

That's it for updates --- this was a long one, but hopefully you can tell that it's been productive here at TPM.

Week of May 25

This week, we onboarded our new team member, Shruti, who is now getting started with implementing her first feature! Shruti will be helping integrate some of the conversation features that come out of ConvoKit (Cristian DNM's package) into our pipeline. We've been talking about integrating this package for a while now, but thus far, it's taken more time and engineering than expected to build up the entire featurization and analysis pipeline, as well as build in some of the initial lexical features. But now, we're excited to move into more complex/advanced inference/models.

After Yashveer helped build out the model playground, I spent most of this week with my hands in the weeds, making edits and playing around with building models. Notably, the pipeline now ingests a collection of dataframes across multiple tasks, standardizes all dependent variables (within-task), and then fits models using both the conversation features and task features. (For now, we're just using categorical dummy variables per task, but we'll soon import the Task Map and use those features instead!) Additionally, the pipeline can allow the user to specify different types of models (currently we support two: XGBoost and LASSO).

Here is an XGBoost model, fit on 5 datasets: (1) Juries (predicting % agreement); (2) and (3) Two CSOP datasets (predicting efficiency); (4) Estimation (predicting post-discussion error percent, relative to true value); and (5) Divergent Association Task (predicting the score):

And here is a LASSO model, fit on the same data:

R^2 for these models is better for XGBoost than for LASSO, by about 10fold; 0.0463 (XGB) versus 0.0045 (LASSO).

While the models return different top features, the interpretations are somewhat similar --- across tasks, it pays to talk less (a consistent finding in our data that we have discussed before, perhaps because talking less trades off with actually solving the task, such as in CSOP, and perhaps due to other issues, such as information overload, social influence, group polarization, etc.). Additionally, it pays to be positive --- having more positivity/positive words is a fairly strong indicator of success across tasks. Finally, both models show that asking too many questions (perhaps an indication of confusion?) is negatively associated with performance.

There's a ton more work to do with regard to better accounting for an integrating task variables --- currently, including them as predictors does nothing more than account for different performance distributions for each task --- but I think this is a step in the right direction, especially relative to last week.

Week of May 18

Initial Model Explorations This week, Yashveer completed an initial 'model playground' where we can take the features generated and output some out-of-the-box models (in this case, we feed all the features into XGBoost and use the shapley values to visualize which ones are most important). We are modeling the problem as a regression; that is, we're trying to predict the target performance metric as a continuous variable. Additionally, we z-score all performance metrics.

I've been playing around inside this playground (pun intended), and here are 3 outputs, for (1) the Juries dataset; (2) the original CSOP dataset; and (3) a newer CSOP dataset:

A couple notes at first glance:

The top predictive features are surprisingly similar; for example, average_num_words appears in the top for all three of these datasets, and in all cases, better performing teams actually used fewer words. I was doing some literature review this week, and this seems to align with predictions in the literature, in which teams that are too chatty actually lead to information overload:

From a review by Marlow et al. (2018) --- and thanks to Nikhil for finding it! ---

A high volume of communication will inevitably impart some useful information, but it may also include irrelevant information that may distract from the more important details. In line with the literature on information overload (e.g., Edmunds & Morris, 2000), we suggest that a high frequency of communication may contain distracting, irrelevant information that may interfere with the ability of individuals to set priorities appropriately. Further, based on cognitive load theory (Van Merrienboer & Sweller, 2005), a large volume of communication may lead to difficulties in accurately remembering and comprehending more relevant, previously received information.

I think that being able to connect the outputs of the models to theories like this is definitely a step in the right direction for this project.

For CSOP teams, teams that do well seem to have a higher content-word accommodation --- a measure of how much a person mirrors the language of the person who spoke in the previous turn. This feature is in the top 5 for the original CSOP dataset, and is ranked lower for the new CSOP dataset (although the patterns are directionally the same).
It is interesting (though perhaps not too surprising) that there are some differences in the top relevant features for the two different versions of CSOP.

There are also lots of technical challenges/next steps to figure out:

Multicollinearity: Many of the features we generate are collinear, sometimes by design. This is because we sometimes generate different versions of the same features (e.g., different ways to measure the same concept), or because we aggregate the same feature in multiple different ways --- mean, max, min, std, etc. Right now, we're sort of just putting it all into the out-of-the-box algorithm and seeing what happens; we can probably do way better than this!
Not accounting for task or user-level features: There's a lot of additional information that we're currently throwing out --- things like user-level inputs (e.g., social perceptiveness) and task-level inputs (e.g., difficulty), which we do have on some of the data. I just haven't quite figured out the best way to model it.
Regression does ... somewhat poorly: For both of the CSOP datasets, the R^2 of the regression is negative (although I don't know how to interpret R^2 in the context of XGBoost?). Applying some basic Bayesian Optimization doesn't currently help, although we may want to try changing the hyperparameters or running the optimization procedure for longer.
How to make sense of the results: I'm not sure whether the right path forward is to fit a separate model for each of the different datasets, or somehow combine the datasets, along with task features, and fit a single model that accounts for Conversation Features User Features Task Features (and any other relevant features of the design space). Right now, it's still a bit tough to make an argument of the sort, 'X is a great strategy for Task A, but Y is a better strategy for Task B' ... but if we combine all the different features, we aggravate another problem, which is ....
The Curse of dimensionality: Conversations are a very, very high-dimensional space (and we have in the order of magnitude of hundreds of features), which means that it's tough to make any robust claims without tons of data.

So, there's lots to chew on, and lots more to do.

Feature Updates This week, we added in 4 new features, found a bug with one of our lexicon features, and are continuing the process of building. Same old, same old (except it's definitely exciting that we continue to grow our feature pool!)

Some stats...

We are at around 18 fully implemented features (which are all summarized / aggregated in various ways); some features also generate multiple columns; for example, the LIWC feature generates one column for each of 50 lexicons. Thus, we end up with almost 300 total feature columns (!!)
We also have documented 97 features from the literature so far. The number of documented features far outstrips the number of features we have built, so one thing we are doing is revisiting the mapping process to connect the built features to different constructs, and prioritize what needs to be built next.

I also had a really productive chat with Will about ways that we'll want to engineer the pipeline a little differently to account for the deliberation data. Part of it has to do with pre-processing (audio versus text features), and part of it has to do with different features of interest (e.g., modeling things like polarization and political orientation, and also outputting features at the user level, which we don't currently do --- I documented this in https://github.com/Watts-Lab/team-process-map/issues/113).

Recruiting Updates We were able to finalize our recruiting process and determine a final candidate to join us for the summer! @shapeseas has once again been such a huge help. Onboarding to follow!

GPT Updates No updates on this, this week. Thanks to Timothy Dorr for reaching out to chat about it! I've been so busy with code reviews for feature updates and playing around with the model that this fell off my radar again. :( But hopefully I will get back to this soon!

Week of May 11

As summer kicks off, we are making progress on all directions:

Closing out outstanding features: There are several features that are nearly complete, but just not fully pushed into the pipeline for various reasons. One priority is simply to "close these out" so that we can ensure our implementation pipeline is clear, and we can move on to new territory. We just closed out 4 new features today, and we're hoping to wrap up the rest by the end of this week. As we close out these features, we've discovered a little bit of technical debt in certain places, which has prompted further improvements to the pipeline.
Continuing EDA and building a model pipeline: Yashveer is in the final stages of building a pipeline that takes our fully featurized data and creates an environment where we can experiment with different models.

Much of this week has also been working on setting up a few big tasks for the days/weeks ahead...

Recruiting in full swing: @shapeseas has found us four excellent candidates to potentially join the team this summer. I've interviewed all of them, and two of them have already submitted their take-home tasks. We should be able to make a final decision and onboard by next week.
(Starting) Explorations with GPT: One area in which we'd like to expand is using GPT to do some labeling of conversations/messages; although I had mentioned wanting to do this some time ago, I wanted to wait until the semester was over before commencing. Well, it's summer now! This week, I've started working on setting up the infrastructure for this.
(Starting) Incorporating Deliberation Data: The deliberation project with @JamesPHoughton has come back with data, and as we wrap up our first round of pilots, now is also a good time to start exploring how to make the TPM pipeline compatible with deliberation data. I will be meeting with Will Schulz, our collaborator at Princeton, to start kicking off this process.

Week of April 20

Taking inspiration from @linneagandhi, I also put together a schedule for the summer, in conjunction with my team: https://docs.google.com/spreadsheets/d/1xcuFaq9176czL9wD5MHon_7jRGYiJACjENfg35Ru1wk/edit#gid=0

There are 5 main activities that will take place over the coming summer:

Literature Review
Feature Building
Model Building
Infrastructure
Writing/Presenting.

By the official end of semester/start of summer, we hope to have all outstanding features wrapped. Priya has continued to hit a few NLP-related bugs, so we are hoping to close those out soon. Yashveer is continuing the EDA process, and we will --- per our discussion last week --- define a standardized requirement for choosing a dependent variable, and then choose a DV for each dataset and get to modeling by week of May 8. We will spend most of May iterating on the model (on the Model Building side) and building out deeper features --- e.g., exploring GPT and other deep learning applications beyond the lexical features we have focused on thus far --- on the Feature Building side. Lit Review will be tailored to support the Feature Building Efforts.

Starting in June, we will transition to infrastructure cleanup/changes --- we will need to audit what requirements need to change after we build out our models and iterate a bit. And by late June/early July, we'll start outlining, presenting (we have a poster at IC2S2), and will finish a write-up by the 2nd year paper deadline in September.

Week of April 13

As we head towards the end of the semester, our primary goal is to try to clean up any outstanding features and start standardizing our datasets in a way that allows us to build a single model to answer the question, "what different conversation features make a team successful across different tasks?"

One challenge along the way is that we do not have a standard dependent variable across the different tasks. "Success" in a jury deliberation means something very different than "success" in CSOP. And even within each game, there are often multiple objectives that can potentially be the dependent variable.

Accordingly, Yashveer has been working on visualizing the featurized conversations and comparing how different clusters of conversations have differences across the various outcome variables. Below is a visual of just the juries dataset, projected in 2 dimensions using t-SNE (right-hand-side). At the left-hand-side is a heatmap of how the different clusters performed on the 4 different dependent variables of interest in the juries dataset: majority percentage, number flipped, flipped percentage, and number of votes.

I'd love to get your feedback on how we should go about selecting the right dependent variable for each of the datasets, so that we can get a unified perspective on what "success" looks like across so many different tasks.

Summer planning updates:

Yashveer is confirmed to be staying on the team for the summer, and his independent study was approved! He has a question for @shapeseas, which is whether he is allowed to do work before May 22 (the official "start of the summer term"), as he is now taking the course for credit? Is there some way to smooth over the gap between the end of the semester (May 9) and May 22?
Priya is also planning to stay for the summer!
Yuluan is graduating, so I will have one open slot for new hires. I recently set up a new version of the RA hiring task, so we are all set: https://github.com/Watts-Lab/tpm_ml_test

Week of March 30

This week I had the chance to sit down for an hour with Dean Knox and Gus Cooney, who are publishing their paper about the CANDOR corpus (analyzing features in dyadic video conversation) very soon. I've gave them the rundown of the Team Process Mapping project, and they seem really enthusiastic about the idea of applying the TPM model on extracting lots of conversational features and mapping team features + task features --> conversational features --> performance. Dean called the idea of studying the team process as 'the ultimate mediator' --- after all, everything these groups do is mediated by their process (which we capture in our data!). Dean and Gus have a lot of experience with video analysis, as their previous work relates to video. So, it was great to discuss this as a potential future direction and to open up a possible avenue for collaboration. Dean also had some thoughts on some issues that I've been pondering, but have not resolved: he suggested ways that I can model the data (which is multi-layered), and we also discussed ideas about whether/how to incorporate human labeling (which, as I've noted before, is really tricky because human labels are noisy and unreliable). Overall, I'm just glad I've met one more person who finds this work exciting and meaningful!

The rest of the machinery is rolling: building more features, fixing a few bugs (Duncan, I fixed the bug that crashed the code when I demo'ed it to you! Haha). It's a boring update that belies the amount of work happening under the hood.

The primary challenge remains that --- as we progress towards more ML-heavy features, we lack labeled data to train models on our own dataset. The best we can do (for now) is to try to find extant labeled datasets (e.g., Twitter, Yelp, etc.) and then run inference on our own data. We don't think we have better recourse than to either find such datasets for now; the alternatives are (1) setting up a labeling pipeline on our own data, which we have discussed repeatedly in the past, but we had decided that humans are bad at these judgment tasks, so we felt the benefit was dubious; and (2) try something like GPT-labeling (which recent evidence shows can lead to a cheaper rating pipeline than MTurk: https://arxiv.org/pdf/2303.15056.pdf). More ideas on this front are welcome!

Looking ahead to the summer, it seems like 2 RA's will definitely be staying on. Yashveer will be staying with the team (doing an Independent Study), and Priya has committed to staying as well (as a paid RA). We're working on finalizing those plans, and Yashveer already has an excellent draft of his independent study proposal.

Week of March 23

We have had an incredibly productive week. Here's a sense of what we've been up to!

First, data. We now have 5 datasets (Juries; CSOP; PGG; and two versions of an Estimation task). We have another 1-2 datasets coming from a collaborator, Nak Won Rim, who should be able to send over his data within the next week. We can make basic plots of those features along a PCA projection of our feature space, and (naively) observe that there are some interesting discussion patterns.

The next step here is to continue to explore this data, as well as to start thinking about ways to start modeling it. I'll want to figure out a way to create a hierarchical model, as I now have a bunch of features for each team (e.g., team size, any composition variables such as social perceptiveness) x a bunch of features for different tasks (from the Task Map) x a bunch of features for the communication process --> Performance variable. I honestly don't yet have a great sense of what these models will look like, so I'm looking forward to starting the exploration process. Thoughts/suggestions on this front are welcome!

Second, features. I merged in about a half-dozen new features this week, and I honestly have a long backlog of pull requests to review from Priya and Yuluan, who are doing a fantastic job. We're continuing to grow the set of features we're able to pull from the chats, which is great.

Third, pipeline updates. Yashveer has been doing phenomenal work helping to speed up our pipeline at all stages. He was able to take a feature-generation process that took an untenably long amount of time to run (20 hours on the full dataset), and reduce it down to 2 minutes. We're working on merging in his changes and applying them throughout the code.

Fourth, literature review. We're constantly getting in new literature from Nikhil; so far, 29 papers have been fully documented.

Week of March 16

So sorry that I missed the Spring Break meeting! But TPM has been rolling along with lots of exciting new updates.

Data Updates: I was able to get connected with Mohammed, and I have set up the system to ingest Public Goods Game data. I am also working on setting up to ingest data from Josh Becker from one of this papers with Abdullah on estimation / the Delphi method. As we scale up, there's a bit of an initial cost on our end to (1) clean up the repository, since the chat is often 'thrown away' in the existing dataset; (2) do general data cleaning to make sure it fits our desired format; and (3) make adjustments to the pipeline. All these changes are now in progress.
Feature Updates: Well over a dozen new features have been completed and are on the brink of merging. We're nearly done with the "lexical" features are are rapidly moving towards more advanced ML-based features.
Lit Review Updates: Nikhil has been doing a great job pulling in new papers. We currently have 59 unique communication features documented (!!!) from 36 unique papers. One issue we discussed in our meeting today, though, is that many of these documented features measure the perception of communication --- not objective features of the communication itself. So, we're still finding this to be a lossy process, in which we find lots of papers --> only some of them are about communication --> only some of these communication features can be measured computationally (e.g., are features we can extract using NLP).

Week of March 2

Exciting update re/ IC2S2 --- I finished (and submitted!) a writeup! You can read it here: Team_Process_Mapping__IC2S22023.pdf

Perhaps my favorite part of it is creating a little proof of concept, in which we run some of our initial features on two of our initial datasets! We're able to get a model that works pretty well, and some insights that make sense! Aaaahh!!! Here's the figure from the IC2S2 submission, with a long caption explaining some of it:

Having completed this step, we look towards:

Finally closing the loop on some of the features that are nearly complete in the pipeline, and starting to build out new ones.
Cleaning up our pipeline on analyzing the data. Through the process of writing the IC2S2 2-pager, I realized that several pieces of our pipeline on the analysis side were not ready yet, and I had to do a few hacky things to smooth it out. I wrote up a description of what we will want to set up here, to prevent future hackiness: https://github.com/Watts-Lab/team-process-map/issues/83
Start expanding to more tasks. @markwhiting if possible, I would LOVE to check in with you on how we can get the dependent variable for the fracture tasks (Git issue here: https://github.com/Watts-Lab/team-process-map/issues/62), so that we can include them as soon as possible. We're already able to do interesting stuff with Juries + CSOP, and adding 3 new tasks would be invaluable! On this note, perhaps now is a good time to reach back out to Phil Tetlock et al., since he previously wanted me to make a presentation about our results and we didn't have any? Now that we have some preliminary findings, maybe I should start reaching out to other people and find lots of new tasks...

Week of February 23

Super short update --- sorry! --- Emily needs to go take a midterm.
Team is working on coding up new features and extracting new features from the literature --- all parts of the system are moving! Hoping to get a working prototype in time for the extended IC2S2 deadline, but will otherwise submit a general writeup.

Week of February 16

Restructuring of our code into classes is now complete! It's been merged into main and we're now going to do a few additional optimizations and cleanups (e.g., parallelizing generation of features and cleaning up formatting of outputs).
Cleaning up of our Index is now complete! Our Lit Review RA's are now focused on searching for new papers and extracting computational features from them.
Generating New Features, Formalizing Feature Standardization: We're continuing to work on building new features, and this week I implemented a new template for all pull requests in our repository, ensuring that all features adhere to the same code, documentation, and testing standards.
Validating Features using human labels? We've been thinking about this labeling question for a while, and I have a proposal here --> https://github.com/Watts-Lab/team-process-map/issues/81 for how we can do it. Would love to get a gut check from everyone so that we can proceed with setting it up!

Week of February 9

Our team has been working to formalize and clean up our pipeline, by standardizing features, encapsulating feature generation code into classes, and reducing redundancy. In particular, whereas our original design consolidated nearly all of the feature generation code into a single file, our new design defines two class, Chat-Level Features and Conversation-Level Features, which encapsulate the features at the different levels. All chat-level features are also automatically summarized (via mean, stdev, max, min, etc.) into the conversation level. With our new design, we can create and process all features using ~5 lines of code in the main file --- and we can scale easily to more than 1 dataset in the future.

Here is a flowchart of our process, taken from our recent meeting. This schematic looks fairly similar to the version that I presented at the meeting last week; in both cases, the high-level goal is to take raw data (far left side), preprocess it, and transform it into chat-level and conversation-level features for prediction.

As a view of our team members' updates:

Yashveer is helping to speed up the pipeline; in addition to leading the charge on implementing our software system using the new design, he's creating parallelization so that we can extract features in a more time- and computation-efficient manner.
Yuluan is investigating the extent to which chats and turns are dependent on timestamps, as well as implementing new features. Our default, in the absence of data otherwise, is to treat 1 chat = 1 turn.
Priya is building 6 features, and after she discovered some potential inconsistency in the features' output, we are working to implement a more robust testing practice (so that we can make sure the outputs of the feature extractors make sense). It's garbage-in, garbage-out, so ensuring that features are correctly extracted is a critical part of the process.
Nikhil has reviewed the dozens (~80) of papers originally in our index, and found that most of them do not pertain to our new scope of communication-level features. We will be doing a second pass to ensure the papers he identified as pertaining to communication are indeed comms-related papers, and we will clean up the pipeline accordingly. A further step would be to restart the literature review, identifying new features about team communication processes.

Overall, it's been a productive week! We're spending a lot of time at the beginning of this semester doing what appear to be "cleanup" or software infrastructure set-up tasks; but as we say in Chinese, the time spent on sharpening one's axe is never wasted. I think that some of the initial ways that I had bootstrapped these systems were rapid but inefficient, and before we scale up further, establishing more robust systems for engineering are critical.

Week of February 2

Work is underway across various pieces of the Team Process Mapping pipeline! Here is a bird's eye view:

Literature Review Cleanup Much of the literature we had previously gathered had been oriented around "team processes" generally, rather than team communication processes (where we had later narrowed our scope). The task of our OIDD299 RA's is to go through our Index of papers, mark the ones that are actually related to communication, and set up a new set of inclusion criteria given our updated scope. This process is now underway; we've now audited several dozen papers over the past week, and the first pass should be done by next week.

Engineering Scaffold Updates In parallel, we are working on coding up sets of features to extract from the text-based data. As we start tackling different types of features, we are realizing that we need to implement a more rigorous engineering scaffold for how to preprocess the data; create clean, reusable functions; and document outputs with time stamps.

Design of Data Preprocessing Pipeline Some of our features are defined on a chat-based level, others on a turn based level, and still others on a conversation-based level. We want to make the engineering decision to "split out" the preprocessing and create a smooth pipeline for calculating features at various levels, and ultimately ending up with conversation-level features for the final prediction (our dependent variables take place at the conversation level). We envision implementing a pipeline that looks something like the following:

With this pipeline, we will...

frontload pre-processing before we calculate any features.
primarily calculate features on a turn level, rather than a chat level. In other words, if someone sends multiple chats within quick succession (as shown in the schematic), we will count their messages as a single turn --- as if they only spoke once. We think that this will be a more naturalistic model of turn-by-turn conversation. However, we will still save features such as the rate of messages sent, so that we do not lose all chat-level information. To do this preprocessing step, we will rely on the time stamps of messages (which will add to the richness of what we analyze).
then use the turn-by-turn data to calculate features at the turn level...
...and summarize them into conversation-level statistics, in addition to calculating conversation-specific features (such as Gini).
use the final conversation-level features in the prediction model (which has DV's at the conversation level).

In addition, Yashveer will be helping with further optimizations to our pipeline:

Parallelizing the processing of chats so that we reduce the time required to process features;
Adding time stamps to features, so that we can better document when/how features are calculated (and can easily revert to a previous version).

Designing the Eventual Experiment Looking ahead, the final piece that I'm thinking about is setting up the analytical pipeline for doing the causal inference we need to answer our research question. I've been really inspired by this article (https://www.science.org/doi/10.1126/sciadv.abg2652), and I'm thinking about how we want to apply the learnings to our pipeline as we look towards building the 'real' model. More ahead!

Week of January 26

Updates:

Lexical Features and Improvements to Software Engineering Infrastructure: Thanks to help from Hancheng, I got the list of lexicons from his and Mark's paper and was able to implement all of them in our feature extraction system! I've been working to program a bunch of the initial basic features and tweaking our computational infrastructure. My goal is to code up what I can myself, to provide examples for RA's, establish the right processes/infrastructure, and dogfood said processes.
Miscellaneous Clean Up: Similarly, I resolved small miscellaneous tasks in our pipeline, like cleaning up the juries dataset to include counts of how many people changed their minds over the course of the deliberation (as a second dependent variable).
RA Kickoff Meeting: I'm writing this update as of Wednesday afternoon; our RA kickoff meeting is scheduled to be Thursday at noon. Hopefully by the time of this meeting, I'll be able to tell you how it went! I set up a slide deck giving an overview of the project and I spent a lot of time this week thinking about how I'll allocate tasks across the different RA's and keep us all coordinated and on schedule. Our team is doubling in size (3 people -> 6 people), and the last thing I want is for us to experience growing pains. Overall, I'm excited to get everyone on board and ready to go!

Unresolved issues:

I feel I still don't have a clear answer for how to create a hand-labeling / machine learning pipeline (re: this discussion topic last week: https://github.com/Watts-Lab/team-process-map/issues/49). We're starting off with the simple features for which this is not necessary, but I think I'd like a clearer sense of what to do soon. Still looking for suggestions on what to look into next / who to speak to about this!

Week of January 19 (with some combined updates along the way)

As we kick off this semester, we're looking to scale up this project! I am grateful that we have received funding from Wharton Analytics, and we will be bringing on new RA's with the grant, as well as through the OIDD 299 mentoring program. I spent a great deal of time in the last 2 weeks working on the hiring process (thank you @shapeseas for all the help!!!), and I'm excited to start the new year with a team that's double the size!

Here's a more point-by-point update of the different pieces of this project:

Project Scoping: 2nd Semester Goals

This semester, my goal is to establish the 'minimal viable product' (MVP) for the project, building towards a summer paper and a completed writeup for that first paper by the fall.

Our goal is to answer the question, 'what kind of team communications matter under what conditions?' One epiphany that I had was that this project could very easily plug into the Task Mapping paper. Specifically, right now I have what might be called a two-sided market; I have, on the one side, the space of team communications; and I have, on the other side, the space of all the tasks and contexts that might be pertinent to teams. Mapping both of these at the same time seems rather overwhelming.

My current plan [open to discussion] is to leverage the fact that (1) the initial datasets we're going off of are jury deliberation, CSOP, and ad writing, and (2) all of these are tasks that we already have mapped in the Task Map. Thus, my plan is to effectively model how the team communication changes as the task (as modeled by our Task Map) changes. The project thus plugs seamlessly into the existing Task Mapping project, and feels like a natural extension/next step. The downside is that this scope is a bit smaller than the "teams in ANY context" project that was initially envisioned. However, I will still eventually scale up our data --- e.g., bringing in new deliberation data, getting that (long-awaited) prediction team data --- and we will therefore have opportunities to expand beyond the tasks in our map in the longer run. That's still the goal!

However, my short-term goal is to have an initial version of the model (for Task Map tasks, and our limited existing set of data) built by the end of this semester, so that we can make adjustments and start writing by the summer.

Feature Building Updates: Starting Simple, then Scaling to More Complex ML Models

Priya, Yuluan, and I are going full steam ahead on starting to build new features. Currently, we're focused on a foundation set of relatively simple computational and lexical features; as our new ML RA is onboarded next week, however, I have in mind a set of more complex features, which will utilize advanced models, such as transformers.

Because we're trying to collect as many lexical features as we can, if you have access to dictionaries for such features used in any previous project, pleas let us know! See https://github.com/orgs/Watts-Lab/projects/8/views/2?pane=issue&itemId=18470903 :)

Feature Building Challenges / Decisions

As we add in new features, one question we have is whether (and how much) we should be hand-labeling. I wrote a longer blurb on this design decision, and I welcome your inputs on the GitHub Issue here: https://github.com/orgs/Watts-Lab/projects/8/views/2?pane=issue&itemId=17994477

Literature Review

Literature review has been on pause, as my 2 existing RA's have been reallocated to feature building. However, 2 of the new RA's I've recruited are going to specifically dedicated to literature review for the next couple of weeks, so I'm expecting this to re-ramp up soon.

Week of December 9

This week, we made progress in building our infrastructure:

✅ It is now possible to run separate analyses on each individual chat, versus an overall conversation.
✅ We tested the ability to go from chat level --> conversational level by implementing a small test feature (Gini coefficient).
✅ The tracker on our master spreadsheet now updates automatically.
✅ We have about 30 fully documented empirical papers. This number is much smaller than the total number of papers we gathered, but given that we pivoted to focus on communication features, we now need to go back to the drawing board (or Google Scholar) and find more communication-specific work. This is an iterative process.

Issues:

People have lots and lots of different computational measures for communication data, and though we've gathered different candidates, we've realized (as always) that the devil is in the details. Different computational measures seem to have different prerequisites: some require that the data be formatted a certain way (e.g., in a communication graph). Some assume that we have labeled training data (we don't, and we may need to collect some soon via Turk). Lots of them require pretty intense models (e.g., CNN or LSTM architecture), and will likely be best run if we move them to HPCC. On this, I've discussed HPCC migration with Miguel and we are going to start that process soon.
Given all of this, we've kickstarted a process of reviewing what all these prerequisites are so that we can tackle them a bit at a time.
I suppose I should have seen this one coming --- of course there is no one-size-fits-all way of analyzing communicational data; so how do we create a unifying paradigm across all these approaches? But honestly, this issue kind of reinforces how hard it is to create a mapping "atlas" as well. It's really, really hard to create a framework that works for even one domain, much less one that extends across domains.

Something to do at all stages of the pipeline:

Perhaps my proudest accomplishment of this semester is creating (and continually refining) a pipeline where work can be done, and tracked, at all stages. At the lit review stage, we need more communication-related papers; at the computational stage, more candidates of computational measures that "translate" social science into code. At the implementation stage, work on cleaning up, improving our software, and building out each and every one of these features.
Especially as I hope to get more RA's on board through OIDD 299 in the Spring, I am proud that this semester, we've created a system where new people can easily "plug in" and get to work.

Week of December 2

A major update with this project is that I have down-scoped it to look primarily at communication features. There is so much to team processes --- and as Duncan rightly pointed out in one of our conversations --- so much of the team process takes place in members' heads. This is why a huge amount of the literature thus far has relied on self-report or survey measures; it's difficult to argue that we can fully computationalize all these constructs (and the endeavor is so large and daunting that it's unclear how exactly I'll finish).

So, for now, we are down-scoping to focus on communication features --- the corner of team process that people explicitly share with the world, and measuring how people interact from what they say to one another.

We're working on (1) continuing to refine our processes, and (2) cleaning up a first set of key features that we have fully "mapped" from the initial papers to a computational model.

The status of this is as follows:

We've done a significant amount of further cleaning up of our lit review, scaffolding, and infrastructure --- deleting unused columns, removing redundant papers that multiple people happened to add, hiding unused tabs, splitting up pages that were confusing into more manageable parts. This process is a bit of a continuous evolution, because as we use the infrastructure, we continuously refine it. We find out that certain parts of the system, as they were designed, don't make sense, and we make it better. This process is also hard to give a concrete report on, in the sense that it's a bit hard to quantify exactly how much better the system is, but with each turn, it is better.
We're finalizing a set of computational candidates for implantation soon. We did a quick audit of the literature we have examined so far, and it seems that 4 features recurred more than others; they are here, assigned to Priya and Yuluan for digging into them computationally: https://github.com/orgs/Watts-Lab/projects/17/views/3
I (Emily) am working on tweaks to the software infrastructures that makes it possible to analyze both individual turns and entire conversations. Currently the infrastructure is built around a turn-by-turn analysis (we run analyses on a per-message basis, not a per-conversation basis). Given that we care about entire conversations, this needs to change.

Some changes to the team:

Due to classwork obligations, our volunteer, Eric Zhong, is stepping away from the project for a bit. I'm grateful that he was my first RA, and he really helped us lay the foundations to get to where we are now.
I have signed up to receive a free RA through OIDD 299 (so hopefully we will get some interested students)! If any students seem to express interest in the project, but can't be hired as paid RA's, please let them know they can join the program and get class credit.

Week of November 18

I am now thinking of the project in terms of 4 phases. Right now we are working on Phase 1-3, and here are the updates for each phase:

1. Phase 1 -- Lit Review Phase ("Build the Process Map")

Along with Yuluan, we cleaned up the literature review pipeline and took stock of the papers we've read so far. We removed some mislabeled papers, regrouped some of our categories, and made the "map" easier to read and navigate. It's always a work in progress, though!
We haven't read too many new papers since shifting to focus more on the computational side about two weeks ago, and so we have made a plan to reintegrate paper reading and keep growing this side of things.

2. Phase 2 -- Computational Collection Phase (collect computational papers and methods to measure the proposed process features in real data)

RA's (especially Priya!) have found lots of interesting packages that will be promising for analyzing the data.

3. Phase 3 -- Computational Implementation Phase (implement those features in code).

The major update here is that we resolved the blocking issue from last week: https://github.com/Watts-Lab/team-process-map/issues/27 is now closed!
I was able to successfully implement an emotions classifier and run it on all of the jury data!

4. Phase 4 -- Application Phase (Analysis of Archival Data)

With Mark's help, we've now added his "bang"/Fracture data to our collection, although some cleaning will still be required.

Week of November 11

I am unable to attend this week's meeting due to being at SJDM in San Diego! You can read my poster here!
This week, the Team Process Mapping team is working on kickstarting our pipeline to operationalize features of team processes in code. I'm glad that we're testing out the full pipeline, because we almost immediately ran into an issue --- it turns out that it's quite hard to integrate different computational measures into one repository that measures a battery of process signals. To make that statement more concrete, recall that we want to generate a large number of measures of a team's communication process --- e.g., how polite is this conversation? How positive is it? And so on. Many of these measures come from existing work in Machine Learning / NLP packages that are implemented by others. We don't seek to reinvent the wheel here, and instead want an easy way to "call" libraries or code written by others in order to measure these different aspects of conversations and output a clean CSV.
Turns out, though, that it's somewhat difficult to do this, especially because not all of the existing tools have libraries that one can simply call or install via pip. As a result, our operationalization work is slightly stalled as I figure out a way to work around the issue.
I documented the problem here: https://github.com/Watts-Lab/team-process-map/issues/27
One broader question of data pipeline design: Is this the right way to approach my goal? I had thought of a relatively simplistic development pipeline, in which we largely work off of local python files or Google CoLab. But some StackOverflow searches and conversations seem to suggest that we might have to download or clone other people's repositories --- a task that will get unwieldy if we have to repeat this task for 10 or 20 different types of features. Should we explore moving this work to a different platform, or a virtual machine? I feel like I would like to talk to someone in the lab about the right system design; will @shapeseas @markwhiting or others know more?

Week of November 4

We have over 100 papers collected, including all the papers related to team processes from 8 top OB journals, published in the last 10 years. We have not read through everything (of course), but are instead adopting an approach where we have selected a subset of what we have read, and starting to implement them so that we get the full pipeline built. Then we will rinse, wash and repeat by going back to read more papers --- creating a more agile, iterative approach. This approach will also hopefully make it easier for new RA's to join, should we get more funding; some people can work "upstream" by reading more papers, and others can work "downstream" by implementing them.
This week, RA's are continuing to look at computational implementations of the different communication features. So far, we have about 4-5 candidates, and we are looking for more over this week. We plan to look over all candidate implementations and code up one of them as a group by next week. In the following week, we plan to divide up other candidates to have people start implementing more independently.
Also this week, Eric visited our group meeting, and we also chatted 1-1 about management practices and building a good research team culture! It will definitely be good to document our practices as we figure things out.

Week of October 28

Wharton Analytics Proposal: I submitted the following proposal to Analytics @ Wharton! Yay, fingers crossed!

Roadmap to a Better Team - Wharton Analytics Funding Proposal (November 2022).pdf

Software Engineering Scaffold: I have finished creating a simple engineering scaffold for managing the operationalization of the features. It's a relatively simple system that takes (as input) a CSV with messages, and outputs a transformed CSV with the features generated. The features are generated from individual functions, which can be created in separate python files (as RA's implement the papers they read, they can work out of separate files and then import the functions), enabling independent, parallelized work with minimal merge conflicts. I also implemented a simple testing suite.
Literature Review: Literature review has pivoted to looking specifically at features of communication, which narrows our scope down. However, in the spirit of "small cake slices," we are seeking to implement some functions end to end and working on getting a complete pipeline, from a theoretical construct to a working piece of code that can be used to make predictions. In this vein, we've momentarily paused reading new theoretical papers, and RA's are now looking at computational papers (mostly NLP stuff and CS papers) that dig into how to measure constructs like "assertiveness," "cooperation," etc. In the coming weeks, we'll be coding these measures up using the aforementioned software scaffold.
Data: The data we are currently using is my old jury data and CSOP data; I contacted Sydney Scott, an Assistant Prof and one of Phil Tetlock's old students; however, though she had a paper on interesting team forecasting data, she did not actually have access to the data anymore. I'm continuing to work on data cleaning and finding more possible team communication / outcome data for analysis!
Collaboration with James: As I continue to work with Dean's group on deliberation, we've looped in James to the collaboration, and we currently plan to collect the data through Empirica.

Week of October 21

Short update: Continued work on the literature review --- this week we are pivoting towards setting up the pipeline to operationalize some of the features! I'm meeting with the RA's in an hour to do a working session, as our goal is to pull out increasingly detailed features and slowly moving from the "theory/literature" side of things to the code side of things.
I'm hoping to set up and get some "seed data" for us to analyze, starting out with my old Jury data and the old CSOP data. Any other datasets that are good? (I'm also working on setting up a partnership with Phil Tetlock + getting data with Dean Knox ...)

Week of October 11

Finished draft of Wharton Analytics grant application!
Updated thinking on next steps / when to stop the literature review. We aim to proceed with a more structured literature search, gathering papers that (1) contain "team performance" in the title and (2) in the abstract, describe something that fits our structured definition of a "team process" and (3) has been published in one of 3 top science journals or 8 OB journals in (4) the last 10 years. This is a defensible process that, combined with our other documented literature review strategies, narrows the scope and ensures good coverage. Simultaneously, we'll pull papers from existing lit reviews/meta-analyses. After we complete this, I think we should have about as comprehensive a list as we can; I think this process creates a relatively clear scope for the project, and solves some of the concerns I had last week. All the details of this plan are documented here.
The Master Sheet with the literature review so far has seen significant updates, with a cleaner process for tracking papers; more papers are continuing to be documented as the lit review proceeds.

(Potential discussion/feedback point) Looking ahead, I'm looking for good resources to think about how to operationalize social science constructs. I have realized that this project hinges on our ability to reliably measure the team process constructs (often vague notions like "cooperation"...). Otherwise, we can't really make claims that we're "testing different causal pathways," since we are creating (potentially weak, noisy) proxies. Are there good places to start? One thought is to use Cristian Danescu-Niculescu-Mizil's work as a potential jumping-off point, as ConvoKit may be useful.

Week of October 7

We have made a ton of progress on organizing the literature review sheet; the primary goal has been establishing a system for tracking the papers and all of their materials, as well as documenting how we are searching for literature.
Some of the next steps from this week include:

Finishing touches on the progress tracker;
Finishing an initial pass on the ~50 papers currently in the system;
Identifying gaps in the papers and collecting new literature (currently, looking for more observational / NLP methods papers, as well as papers from 'top OB journals')

Something that I need help on (perhaps useful to discuss with Linnea): what's a good way to set quality control criteria for the papers? I feel like this is an issue that I keep punting, because there has been decades of teams research, and I don't necessarily want to ignore old work, while at the same time, I only want to include "good" research in the review. How should I set these criteria?

Week of September 30

Literature Review Progress This week, we've been making great progress on organizing the literature review — just making sense of the literature so far. In particular, we have been dividing the existing papers into different categories so that we have a better sense of where we are over/under-sampling, so that we can identify gaps in which we need to add more research.

Here's a screenshot of a great landing page organized by Priya:

Getting more data I have met with Dean's political deliberation team, and they're very open to collaborating and sharing data. They are running a bunch of experiments collecting audio and video data for teams deliberating controversial political topics. As I consolidate a model for team processes, they are willing to give me data so that I can make predictions about the outcomes of the deliberations. This will provide a promising context for testing the model!

Week of September 23

Our RA's are onboarded, and we had a great working session this week!

Our upcoming goal is to make a pass through all of the literature that we have so far and clean it up. I had collected a few dozen papers, and Eric Zhong had also collected a few dozen papers over the weeks that he's been working with us. In this pass, we want to synthesize what we have and identify the gaps in our current literature collection that we will want to fill in our next stage.

Inspired by Linnea, I've been working on a document (https://docs.google.com/document/d/1Rj2if3G2jngHU_PmFdcI3SiEOGrHNiW7IRlQs_A5GuI/edit) that tries to outline a few key ideas to guide our literature review process:

What is a "team process?"
What are the key search terms and criteria we are looking for?
What are the quality criteria by which we will include a paper in our review?

This document is very much a work in progress, and will continue to evolve as we clean up the existing literature and consolidate what we want.

A final piece is that I've been looking for data to connect to this project one it reaches the appropriate stage. After we "map out" the team process, I want to use the models of team process to make predictions on different datasets of teams interacting. Since I know that getting good data is half the battle too, I have been reaching out to people to ask for data/collaborations. This week, I made a connection through Dean Knox's lab and am meeting with Will Shultz (https://willschulz.com) a PhD student at Princeton who is collecting some data involving team deliberation.

Week of September 12

[x] Onboard RA's! (Google Drive, GitHub, Slack)
[x] Set weekly meeting schedule
[x] Set literature review goals for next week

markwhiting commented 2 years ago

It would be interesting to see a short overview for this project, perhaps in the next update.

shapeseas commented 2 years ago

Excited to have your crew on board Emily! I have issued "Intro to GiHub" trainings for Priya + Candice; https://github.com/Watts-Lab/lab-setup/issues/89 https://github.com/Watts-Lab/lab-setup/issues/90

xehu commented 2 years ago

Thanks Eric for helping hire and onboard!!!

AND @markwhiting I have an 'About' page written here in case you're interested: https://github.com/Watts-Lab/team-process-map Happy to present too, though next Friday is stressful because I also have my summer paper oral defense...

linneagandhi commented 2 years ago

Cool! Reading the latest update re making predictions on Dean's data -- @xehu can you clarify why you are going to make predictions on deliberation? That sounds more like @JamesPHoughton project would inform that, but maybe more of the tasks are deliberative-ish?

linneagandhi commented 2 years ago

Also I really like that you are balancing out your sampling in the lit review! Very smart!

xehu commented 2 years ago

@linneagandhi to answer your question about why I'm using deliberation data; honestly, I was reaching out to everyone I could to figure out where I can get a dataset that satisfies the following criteria: (1) contains team communication and interaction data; (2) contains some kind of outcome data. Dean's group was enthusiastic about my pitch and offered me data in the format that I asked for. I don't have a particular interest in deliberation, and the outcomes I will use will be those that are interesting to Dean, Will, and Chris. Their interest is specific to political decision-making, so it's likely very different from what James is interested in.

More broadly, my vision for Team Process Mapping is to understand which team processes/pathways are "dominant" in different contexts. So, I see deliberation as a specific context in which I can answer the question, which team processes are most relevant in determining team outcomes? I also hope to compare these results to collaboration data from different contexts --- for example, software engineering, Turkers on Empirica tasks, and so on.

linneagandhi commented 2 years ago

@xehu Cool! Another random idea then -- how about all the data on superforecaster or general forecaster communications from Phil & Barb's work? There the DV is accuracy of prediction. I think they have a ton of chat data of folks comparing notes, arguing, etc. in teams.

xehu commented 2 years ago

@linneagandhi yup! I actually already talked to Phil about it! He said to get back to him in mid-October, so I have a reminder on my calendar to email him again on October 17 or 18, haha! But yes, this is on my radar too. Please share any other teams that potentially meet the criteria!

markwhiting commented 2 years ago

Re your question, I agree with Convo Kit, or a more general version of it like I think we might have discussed before, creating models for theoretical concepts and validating them on a per-theory level. I think thats a really interesting technique for the study of stuff like this.

duncanjwatts commented 2 years ago

Linnea (with help from Eric) is building a pipeline in Google Docs for extracting experiment metadata and data from papers. Maybe that would be helpful?

amaatouq commented 2 years ago

(Potential discussion/feedback point) Looking ahead, I'm looking for good resources to think about how to operationalize social science constructs. I have realized that this project hinges on our ability to reliably measure the team process constructs (often vague notions like "cooperation"...). Otherwise, we can't really make claims that we're "testing different causal pathways," since we are creating (potentially weak, noisy) proxies. Are there good places to start? One thought is to use Cristian Danescu-Niculescu-Mizil's work as a potential jumping-off point, as ConvoKit may be useful.

I've been thinking about somewhat related issues lately and discussed them with Matt Salganik in the context of the Fragile Families study. I am going to share a summary of the back-and-forth with Matt here:

Here’s how we are thinking about things now. Imagine there is some true outcome Y and a contaminated measure of the outcome Y. Further, imagine that there are some true predictors X and contaminated measures of these predictors X. Finally, for now, let’s assume that we don’t know much about the relationship between the true values and the contaminated measures.

If you are interested in the Pearson correlation between Y and X, and if measurement error is “classical” then you can divide the correlation of Y and X by the product of the reliabilities of X and Y (dx.doi.org/10.4135/9781412961288.n81).

If you are interested in the relationship between X* and Y, as expressed by a regression coefficient, you can do something called Simulation-Extrapoliation. Roughly, if you have a model for measurement error, then you can add more measurement error and then extrapolate back to what the relationship would be with no measurement error (jstor.org/stable/2290994). This is closely related to a large body of work in statistics on measurement error models, which focus on errors in the predictors (jstor.org/stable/2669787).

If you are interested in the relationship between X* and Y, in a predictive sense, we have not found much. This paper seems to show that it depends a lot on the details (doi.org/10.1002/sim.6498), but this paper seems overly general (dx.doi.org/10.1198.jasa.2009.tm07543).

If you are interested in the relationship between X and Y you can estimate the test-data noise ceiling. In broad strokes, it is the maximum accuracy you can predict Y given the measurement error in Y* (doi.org/10.1371/journal.pcbi.1006397). We found the figure in this blog post helpful: diedrichsenlab.org/BrainDataScience/noisy_correlation/index.htm

Genetics and polygenic scores also seem to attempt to correct measurement error, but we have not yet found any good papers on that (but I have not looked much).

Stepping back from all of this, I take away two main conclusions.

It is important to be clear about what question one is asking. Is it about X and Y or X and Y or X* and Y or X and Y?
If we want to learn about things involving the unmeasured true quantities, then we need to learn (or make assumptions) about the relationships between the measured and true quantities. Unfortunately, this seems blah.

linneagandhi commented 2 years ago

Congrats on submitting your proposal!! Fingers crossed!!

linneagandhi commented 2 years ago

@xehu remind me - do you just need any conversation by a group + an outcome of that group conversation? I wonder if CSPAN hearings or Parliament hearings in the UK could work. Like, when something is put to a vote? (Just trying to be helpful if you need more data. If you don't I'll stop brainstorming :-)

xehu commented 2 years ago

@linneagandhi yup --- I'm mainly looking for group + outcome, although I'm currently considering smaller groups / teams .... when it gets to the entire parliament, and with the complexity of politics, things can get pretty complicated! That's a really good idea, though!

markwhiting commented 2 years ago

Happy to chat about system design for that problem! I think the answer is probably that it's going to be tricky no mater how you do it.

JamesPHoughton commented 1 year ago

I'd like to chat sometime about applying your feature extraction measures to the deliberation videos. Clear synergy, if I understand what's going on properly. =)

markwhiting commented 1 year ago

Synergy is rare!

Also, interested about how much you can scale your feature extraction process and if thats something we can easily apply in the other teams project?

xehu commented 1 year ago

@JamesPHoughton yes --- it's my hope that this system can be used to analyze anything that involves (1) team communication data and (2) some outcome. That is also why I am involved in the project with Dean; a synergy came up such that I was offered a chance to engage with the project so that, eventually, I'd include that data in the analysis as well. So this is in the plan!

If you have other deliberation data that could be contributed, it would be great to even splinter off and look at a deliberation-focused paper. In fact, I'm currently piloting using the old chat data from my own deliberation paper with Mark and Michael Bernstein (published in 2020)!

As you can see from these updates, though, we're currently still extracting features and engineering the pipeline; this project is new and started just this semester. So we don't yet have a system in place for actually doing a very advanced analysis, but I hope that as we advance, we will be able to scale up and create a very generative infrastructure for multiple kinds of analysis.

linneagandhi commented 1 year ago

"I suppose I should have seen this one coming --- of course there is no one-size-fits-all way of analyzing communicational data; so how do we create a unifying paradigm across all these approaches? But honestly, this issue kind of reinforces how hard it is to create a mapping "atlas" as well. It's really, really hard to create a framework that works for even one domain, much less one that extends across domains." @xehu

Love it, commiserating right alongside you!

shapeseas commented 1 year ago

@xehu love the project kickoff slide deck!

xehu commented 1 year ago

@shapeseas thank you so much!!

xehu commented 1 year ago

Notes from meeting:

Naive thing: take the order of the chats as the order of who said what after whom (and keep it to 1 turn = 1 chat); Duncan warns not to get too fancy with time stamps
Look into existing methods of analyzing conversations, perhaps here: An Introduction to Discourse Analysis: Theory and Method by James Paul Gee
Start with the histogram of turns. If it is bimodal, then we can split things into turns; if not, then perhaps it is not useful.
Start with measuring how often this happens? (Once every 10 minutes, or 20 times per 10 minutes?)
Another test: if you reverse the order, check that your features are even sensitive to those changes; if not, then it may not matter
Look at Mark's bucketing of turns in Viability prediction --- e.g., just automatically bucketing something that was continuous.
Potential concern: If you really care about what info someone is taking into account when they type their message, then you can't move something underneath and move it above

xehu commented 1 year ago

Lab meeting notes:

Look into Open Sourced Projects and Email Archives as a Source of Data -- people have threads in which they discuss issues and later resolve them
Sometimes you can download email archives as an Email API, and sometimes you can download archive of emails, or through the web (e.g., a Google Group)
Linux open source development projects
However, open source projects are qualitatively different from lab experiments
Ping Abdullah and get experimental data from other Empirica chats

xehu commented 1 year ago

Notes from lab meeting:

Moving forward, be clearer about which features are perceptions and which aren't. Limit the scope of our study to features that are behavioral, as this is what we can measure.
Do our due diligence with literature search, but in the end, similar to task mapping, it's a proof-of-concept --- so we can always add more later.

markwhiting commented 1 year ago

Cool. Hoping to get you more data soon.

shapeseas commented 1 year ago

@xehu glad that DATS proposal got approved! Could you send the final to me or point me to where it sits?

re: working dates - he can start on the project as it relates to coursework as far as I'm concerned. Or is he asking for pay between 5/9 and 5/22? If so, that may be a bit more difficult to set up quickly but we can pursue that if you strongly prefer it.

xehu commented 1 year ago

@shapeseas following up with you on slack :)

xehu commented 1 year ago

Note from DV standardization posted here: https://github.com/Watts-Lab/team-process-map/issues/57

linneagandhi commented 1 year ago

@xehu Congrats on all you've achieved so far! You're not even done with year 2!

xehu commented 1 year ago

Thank you, @linneagandhi <3 thankful to have you as an example to learn from!

markwhiting commented 1 year ago

Great! Neat to see the pipeline coming together and being used in this way. Does the model matter? As in, what about using more like autoML? You might get much stronger predictors and could use some explainability techniques to check if features or bundles of features you care about seem to be standing out as important.

Watts-Lab / team_comm_tools

🕵🏻‍♀️Team Process Mapping Weekly Updates #5

Week of June 29

Week of June 15

Week of June 8

(Update 1 of 3) Modeling

The Multi-Task Joint Model

Predicting and testing out-of-sample: a CSOP Case Study

And some general explorations

(Update 2 of 3) New Conversational Features

(Update 3 of 3) The "Mapping" Part

(Epilogue) On how to turn this whole thing into a paper

FIN

Week of May 25

Week of May 18

Week of May 11

Week of April 20

Week of April 13

Week of March 30

Week of March 23

Week of March 16

Week of March 2

Week of February 23

Week of February 16

Week of February 9

Week of February 2

Week of January 26

Week of January 19 (with some combined updates along the way)

Project Scoping: 2nd Semester Goals

Feature Building Updates: Starting Simple, then Scaling to More Complex ML Models

Feature Building Challenges / Decisions

Literature Review

Week of December 9

Week of December 2

Week of November 18

Week of November 11

Week of November 4

Week of October 28

Week of October 21

Week of October 11

Week of October 7

Week of September 30

Week of September 23

Week of September 12