Watts-Lab / team_comm_tools

An open-source Python library that turns multiparty conversational data into social-science backed features.
https://teamcommtools.seas.upenn.edu/
MIT License
3 stars 5 forks source link

🤖 Map-GPT #89

Closed xehu closed 3 months ago

xehu commented 1 year ago

This week, @amaatouq reached out with an interesting idea, which is that we can potentially train a pipeline for using GPT to rate tasks (and even test to see if GPT can replicate our raters' mapping of task features). Mohammed Alsobay has done some amazingly rapid iteration, showing some promising initial results on using GPT on the Task Mapping questions.

Here are some of the results Mohammed found, using just 2 questions (Q1 - physical/mental and Q22 - conflicting tradeoffs).

Tried two questions: Q1 - Mental/Physical and Q22 - Tradeoffs… I’m rate-limited so I developed on Q1 and chose Q22 because I want to find a question with high variance in task scores.
I present:
- The question
- It’s elaboration
- The possible answers, forcing it to choose one -> I then map these to 0/1/null

Granted, this was done quickly so these inputs have html tags in them (i.e. the original format) since I didn’t bother regexing them out, but it seems to get the message.

In terms of evaluation:
- Correlation between the binary GPT scores and the raters: 0.45 on Q1, 0.61 on Q2
- Agreement at threshold of 0.5 for raters: 90% for Q1, 70% for Q22
- ROC AUC as a generalization of (2),  taking GPT binary output as the “labels”, and raters as “predictions”: 0.8 on Q1, 0.9 on Q22

We're now interested in playing around with pre-training and fine-tuning the model, to see if it can do better.

Given all this, I'm wondering whether...

Thoughts?

linneagandhi commented 1 year ago

Neat idea! Keen to see more results as you play with it.

We tried chat GPT and a few other text AI tools to try to automate some of the fields where we are summarizing elements of a study, and we've had little success. It's been (a) inaccurate (!), (b) oddly inconsistent in formatting, and (c) includes irrelevant stuff that we don't want (e.g., details about a treatment when we really just want text focused on the outcome variable). I think it would be neat to incorporate some AI fields -- mostly to SAVE TIME in the coding process if not boosting reliability. But these issues have frustrated our team to date. If you find a better solution that actually works in terms of accuracy, reliability, and precision please LMK! The RA on my team who has worked most on it is Anushka Bhansali, in case you want to chat with her at all.

TimothyDorr95 commented 1 year ago

Sounds very interesting. There is a bunch of literature (called prompt engineering) that deals with how to get the desired output. I could see using those best practices and methods helping mitigate the issues of inaccuracy, inconsistent formatting, and the addition of irrelevant stuff.

I think generally, using "few-shot prompting", where you give example input and output (basically like training data) really helps with formatting and accuracy. Also, using very clear formats that contain no letter symbols can help (Context: {} Answer: {}. Lastly, there is this interesting idea called chain of thought, where you ask the model to give intermediate reasoning steps, I think this could potentially also be helpful. I'm by no means an expert at this, but I have read a bit into the literature and would be happy to chat about it if you wanna play around with it and see if we can get it to work better.

markwhiting commented 1 year ago

We do (have API access to GPT-3)! I can add you.

We have been using it for Common sense and I played a little with it for mapping but found it a bit too unreliable. Like running the same request gets different responses. This is happening less with GPT-4, but doesn't appear entirely robust.

xehu commented 1 year ago

Yes please!! @markwhiting, can you please add me? I'd love to play with it!

On Thu, Mar 16, 2023 at 3:33 PM Mark Whiting @.***> wrote:

We do (have API access to GPT-3)! I can add you.

We have been using it for Common sense and I played a little with it for mapping but found it a bit too unreliable. Like running the same request gets different responses. This is happening less with GPT-4, but doesn't appear entirely robust.

— Reply to this email directly, view it on GitHub https://github.com/Watts-Lab/team-process-map/issues/89#issuecomment-1472630402, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG3VWKLBC572M3QSDJSYTG3W4NTKDANCNFSM6AAAAAAV5RBKGI . You are receiving this because you authored the thread.Message ID: @.***>

xehu commented 3 months ago

Closing as this is about a previous version of the project and no longer related to the toolkit