Watts-Lab / team_comm_tools

An open-source Python library that turns multiparty conversational data into social-science backed features.
https://teamcommtools.seas.upenn.edu/
MIT License
3 stars 5 forks source link

Create a Labeling Pipeline to Validate Computationally-Extracted Features #81

Closed xehu closed 2 months ago

xehu commented 1 year ago

Recap

In Team Process Mapping, we are generating a bunch of computational measurements of team communication features that are closely tied to theories from behavioral science. We then intend to use these features to predict team performance across different tasks, to understand how different theories about team communication processes "play out" across different task contexts.

The Problem

At the end of the day, the features that we are measuring are only proxies of the "true" underlying team feature. Consider, for example, the following causal diagram. We might think, based on behavioral science theories, that Feature A (Positivity) positively impacts our Dependent Variable B of interest (Team Performance). But in reality, we do not have access to the true positivity of a team --- that's inside the hearts and minds of team members. We only have access to what they said during a given interaction, and we can generate proxy measures (e.g., Proxy A' and Proxy A'') using different computational methods to try to predict the underlying variable.

This problem generally applies to all features; in the image below, we also extend it to some Feature C (measured by Proxies C' and C'') and Feature D (measured by Proxies D' and D'').

Image

Suppose then that we use these proxies to predict team performance, and we find that D' is stronger than A' for predicting team performance.

Can we then say, for this task, "Feature D is a stronger predictor of team performance than Feature A?"

One potential issue here is that we don't know whether the reason Feature D' > Feature A' is because Feature D > Feature A (e.g., the underlying feature is a better predictor), or simply because D' is a stronger proxy of D than A' is a proxy of A (in other words, A' is just a weak proxy for an otherwise strong theory). We need some way to validate or evaluate the computationally-extracted features so that we can argue that they all do a reasonable job of quantifying the underlying human construct.

To do this, we need labels.

Proposal: Creating a Parallel Labeling Pipeline to Validate Computationally-Extracted Features

I'd like to gut-check the following proposed pipeline:

  1. As we extract features from the literature review, we will also note down the exact original definitions and survey questions traditionally used to measure those constructs in psychology.
  2. Just as we did for Task Mapping, we create survey questions around those original definitions/questions.
  3. We put those survey questions into the same rating pipeline as Task Mapping, and we ask our panel of High-Effort Turkers to rate a small subsample of team conversations (e.g., ~100 chats).
  4. We use those 100 chats to sanity check / evaluate how well the proxies perform against human judgment.

In effect, I'm looking for some way to create "Gold" labels so that we can sanity check that the features we are extracting make sense and cohere with the measures from behavioral science that they are meant to approximate. My proposal is to leverage a pipeline that is similar to what we have already built for Task Mapping, but to do so for the behavioral features that we extracted in the Process Mapping process. The proposal is to do this on only a very small subset of the conversations, primarily for validating our computational measures.

Of course, one objection is that the human labels from Turk still won't perfectly match the original measures from behavioral science. After all, in the original psychology papers, many of the features we extracted were measured via self-report surveys; e.g., people evaluated their own subjective feelings. Here, the best we can do is to ask other people to read the chats and make a judgment.

However, the hope is that we can use these labels as a sanity check --- as good of a sanity check as we can get, anyway --- to confirm that our features are reasonable proxies for the behavioral constructs they are intended to measure.

Image

linneagandhi commented 1 year ago

Very cool!! I particularly like the idea of having 2+ ways of getting at the same underlying construct, demonstrating any gaps created by that measurement degree of freedom. That itself could be a paper maybe?

shapeseas commented 1 year ago

Do the features have a score for every team member or is it one score for the whole conversation? I can imagine one member may feel differently about turn-taking or positivity than another in the same conversation. How is that reconciled in the feature score? One person may feel positivity, but the models say it wasn't.

^Maybe not helpful, just a thought I had when reading :D

xehu commented 1 year ago

Notes from Lab Meeting:

Image

Duncan suggests:

  1. You get two measures (e.g., Naive Bayes and BERT); check that they're highly correlated --- if they are correlated, then there is no problem
  2. If they're not highly correlated, throw both into the model, and see if one is more predictive than the other, and then use that --- you can also say, 'they are capturing different elements of the same thing'

James: Do people care about an ML measure of positivity, or whether you surveyed them --- do people leave the discussion having a positive experience?

Linnea: This feels a bit unsolveable, kind of like any researcher DF: maybe the answer is you never will get a perfect measure, so make a decision, make it transparent and admit the limitations, and then people can critique and help improve it.

James: If you have an outcome that you care about (task performance) and some intermediate outcomes (positivity, turn taking) and some interventions (communication skill training) then the engineer/manager just cares about the relationship between the intervention and the outcome. You might care about the intermediate outcomes if you have some other intervention that works on those intermediating variables that you could choose to draw on

In that case, you’d want to use whatever intermediating measures you used to measure these extra interventions...

Duncan: Thinks that the main contribution is not about the goodness of the 'metrics' -- it's not about psychometrics or ML

Duncan: What keeps us honest is out-of-sample prediction. And if you had a better proxy, maybe your predictions could have gotten better, then your model could get more predictive.

Linnea: Ben Lira might have an idea or two (Emily I think he was in Uri's class with us last fall)

xehu commented 1 year ago

@shapeseas we calculate features at 2 levels --- 1) per chat; and 2) per team member, and then aggregated at the conversation level.

markwhiting commented 1 year ago
15:18:51 From Mark Whiting to Everyone:
    https://github.com/orgs/Watts-Lab/projects/8/views/1
15:32:58 From Linnea Gandhi to Everyone:
    Could results just trigger IFF real? Or is the staggered timeline the problem? (Sorry if I missed a tech detail)
15:33:23 From Emily Hu to Everyone:
    Can we not just run an experiment where we recruit reviewers into an experiment instead of doing a field study
15:33:46 From Emily Hu to Everyone:
    does it have to be an audit study?
15:34:54 From Linnea Gandhi to Everyone:
    you give a continuous score, standardize by reviewer, and the top X get in
15:35:35 From Linnea Gandhi to Everyone:
    or a two-stage process -- you do the ratings, then submit, then chatGPT dropped, you are shown your ratings of the real ones again and re-vote on action
15:35:54 From Linnea Gandhi to Everyone:
    I love how we academics are all in on experimenting on others but nothing that impacts our own consequences :-)
15:36:03 From Linnea Gandhi to Everyone:
    This is why corporations don't experiment.
15:37:53 From Linnea Gandhi to Everyone:
    Reacted to "Screenshot2023_02_16_153657.jpg" with ❗
15:37:56 From Mark Whiting to Everyone:
    Accept!
15:42:18 From Linnea Gandhi to Everyone:
    HAHA the problem with like ALL psychology. Proxies for our brains.
15:42:44 From Mark Whiting to Everyone:
    Reacted to "HAHA the problem wit..." with 👍
15:43:06 From Linnea Gandhi to Everyone:
    A good book: Psychometric: An Introduction (Furr & Bacharach)
15:43:29 From Linnea Gandhi to Everyone:
    Ch 8-9 could help at least understand how this is traditionally tackled.
15:46:52 From Linnea Gandhi to Everyone:
    WWJD
15:46:57 From Linnea Gandhi to Everyone:
    What would James do (in this situation)
15:47:47 From James Houghton to Everyone:
    Don’t give me too much credit
15:47:57 From Linnea Gandhi to Everyone:
    Reacted to "Don’t give me too mu..." with 💳
15:48:09 From Linnea Gandhi to Everyone:
    Reacted to "Don’t give me too mu..." with 🍰
15:49:35 From Linnea Gandhi to Everyone:
    Ben Lira might have an idea or two
15:49:47 From Linnea Gandhi to Everyone:
    (Emily I think he was in Uri's class with us last fall)
15:50:09 From Mark Whiting to Everyone:
    Reacted to "Don’t give me too mu..." with 🍰
15:53:36 From Linnea Gandhi to Everyone:
    This feels a bit unsolveable, kind of like any researcher DF: maybe the answer is you never will get a perfect measure, so make a decision, make it transparent and admit the limitations, and then people can critique and help improve it.
15:53:58 From Mark Whiting to Everyone:
    Reacted to "This feels a bit uns..." with 👍
15:54:23 From James Houghton to Everyone:
    If you have an outcome that you care about (task performance) and some intermediate outcomes (positivity, turn taking) and some interventions (communication skill training) then the engineer/manager just cares about the relationship between the intervention and the outcome. You might care about the intermediate outcomes if you have some other intervention that works on those intermediating variables that you could choose to draw on
15:55:15 From James Houghton to Everyone:
    In that case, you’d want to use whatever intermediating measures you used to measure these extra interventions...
15:56:28 From Linnea Gandhi to Everyone:
    (Sorry dogs dissected a toy - brb)
16:03:08 From Linnea Gandhi to Everyone:
    "nobody got fired for hiring mckinsey"
16:03:15 From James Houghton to Everyone:
    I thought it was IBM?
16:03:35 From Linnea Gandhi to Everyone:
    haha probably both
16:03:45 From Linnea Gandhi to Everyone:
    we always said mckinsey bc we were competitors :-0
16:04:06 From Linnea Gandhi to Everyone:
    "nobody got fired for citing something with 100 cites"???
16:05:21 From James Houghton to Everyone:
    My adage: All academic disagreements boil down to ambiguity in the research question…
16:05:30 From James Houghton to Everyone:
    =)
16:07:01 From Linnea Gandhi to Everyone:
    Reacted to "=)" with 👍
16:08:47 From Linnea Gandhi to Everyone:
    Mark's comment makes me think of "Policy Prediction Problems" from Sendhil
16:09:00 From Linnea Gandhi to Everyone:
    I don't need to know why its raining but just to bring an umbrella
16:09:59 From James Houghton to Everyone:
    Good enough is better than most?
16:10:47 From Emma Arsekin to Everyone:
    I have to run to another call, but thanks everyone!
16:12:05 From shape@upenn.edu to Everyone:
    @Mark can you download the chat? Maybe a helpful esource for Emily / we can upload to the github issue as part of the gut check (if everyone is okay with it)
16:12:56 From Mark Whiting to Everyone:
    Reacted to "@Mark can you downlo..." with 🍸
xehu commented 2 months ago

Closing as this is about a previous version of the project and no longer related to the toolkit