JBGruber commented 1 month ago

This has yielded excellent results in our coding in Limassol, where we did this with ChatGPT. Essentially, we simply ask a model to answer the questions we ask coders (outlined here). ChatGPT has a way to handle PDFs on the web interface (maybe also on the API). Ideally, we would try to do this with an open source model like llama3 (open-webui also supports PDF uploads, not sure how they handle it). But given the length of papers and the fact that we might not be too concerend with reproducibility in this case, GPT-4o might make most sense.

JBGruber commented 2 weeks ago

Salamanca task description

I think we essentially just need a RAG system. I've started a Quarto notebook here: https://github.com/JBGruber/opinion-wg2/blob/gllm-annotation/paper-annotation-gllm/llm-annotation.qmd

The notebook contains code that downloads the validation data already. I think it's unlikely we can validate this automatically. Probably it makes more sense to check manually whether each answer is correct or not and note it down. (obviously we need to set a seed to reproduce the answers)

You can find the Codebook here: https://docs.google.com/document/d/185Q1IuJ0ebIFEb1BepMxkbzt23XrYX_QOJCUGqrKdbw/edit#heading=h.hkgzwqhr5jie

Or in this Notebook that set up the task (it also contains the variable names in the annotated data): https://github.com/JBGruber/opinion-wg2/blob/gllm-annotation/paper-annotation/3._wg2_full_paper_annotation.qmd

You should work in the "gllm-annotation" Github branch for this. You don't have to use my approach or the notebook though (also I much prefer it over Jupyter and you can use it with Python). And if you have a better idea, let me know.

I would prefer that this is done via an open model, like llama3. But if this does not work well enough, I think we should consider OpenAI. Here are some thoughts:

the data we use is not private in any way. There are published articles. But there are copyright concerns when sending the PDFs to openAI
reproducibility is also not a grand concern in this specific case. We are doing a literature review, we don't measure some important social phenomenon. It's very unlikely anyone would ever reproduce this with the same data. Rather, people would use the approach for a larger and/or newer review
we do not have money to pay for API requests. But this is probably not very expensive. I'm happy to donate some of my daily allowance from COST to this end. As long as this doesn't amass a massive bill.

atomashevic commented 2 weeks ago

Bruno is working on a pipeline for processing papers with GPT4. Right now, the bottleneck is parsing URLS stored in footnotes which are far away from the reference source in text.
Once he solves this, we move forward to process ~ 10 papers and estimate the costs before proceeding to test all papers from the test sample.
Depending on the costs, we need to agree on how to pay and which API key to use.
We proceed to test all papers and evaluate the perfomance.
During this process, we will determine whether this sample is enough or we need to annotate more full papers #14
[ ] We need to agree on the slot for next Zoom meeting after Salamanca.

brunoyun commented 2 weeks ago

The question in Q31_mult is the same as Q32. I believe this is a typo

In https://github.com/JBGruber/opinion-wg2/blob/gllm-annotation/paper-annotation/3._wg2_full_paper_annotation.qmd

JBGruber / opinion-wg2

Annotate full papers with generative large language models #20

Salamanca task description