Eval paper-qa 2 on lit qa 2

amanpreet692 commented 3 days ago

Hello, Thanks for the great work! I was looking for a way to eval paper-qa 2 on lit-qa 2 and was wondering if you can help me with a couple of queries: i) Assuming I am able to get the pdfs from the dois in the dataset into a local folder, how do I go about at least running paper qa-2 end to end specifically on the lit-qa 2 multiple choice questions? ii) Are the answers from paper-qa2 and other systems in the paper on lit-qa 2 questions available publicly? iii) Lastly, the human and non RAG results on lit-qa 2 in the lab bench paper seem different that in paper-qa2 ... I am not sure if I am missing something, but can you please say a bit more on this?

Thanks!

dosubot[bot] commented 3 days ago

Hey @amanpreet692! 👋 I'm here to help you with any bugs, questions, or becoming a contributor. Let's squash those bugs together!

To evaluate PaperQA2 on Lit-QA 2, here are the steps and answers to your queries:

i) Running PaperQA2 End-to-End on Lit-QA 2 Multiple Choice Questions

Install PaperQA2:
```
pip install paper-qa
```
Prepare Your Local Dataset:
- Ensure you have the PDFs from the DOIs in the Lit-QA 2 dataset in a local folder.
Run PaperQA2:
- Navigate to your directory containing the PDFs:
```
cd my_papers
```
- Use the pqa command to ask a question:
```
pqa ask 'How can carbon nanotubes be manufactured at a large scale?'
```

This process will index your local PDF files, gather the necessary metadata, search over that index, break the files into chunked evidence contexts, rank them, and ultimately generate an answer ^[1].

ii) Availability of Answers from PaperQA2 and Other Systems on Lit-QA 2 Questions

I wasn't able to find an answer within the repository, but the answer may be available elsewhere or I could have missed it.

iii) Differences in Results Between Lab Bench Paper and PaperQA2

The differences in results between the Lab Bench paper and PaperQA2 on Lit-QA 2 could be due to several factors, including different evaluation metrics, datasets, or configurations used in the experiments. Without more specific details, it's challenging to pinpoint the exact reasons for the discrepancies. It might be helpful to review the methodologies and configurations used in both evaluations to understand the differences better.

To continue talking to Dosu, mention @dosu.

jamesbraza commented 3 days ago

Hello @amanpreet692 thanks for the questions.

i) Assuming I am able to get the pdfs from the dois in the dataset into a local folder, how do I go about at least running paper qa-2 end to end specifically on the lit-qa 2 multiple choice questions?

We have a bunch if tickets here on making some docs for end-to-end usage of the code. For now, you can look at the unit tests for direct access to the code.

ii) Are the answers from paper-qa2 and other systems in the paper on lit-qa 2 questions available publicly?

Yeah they are from the LAB-Bench paper, whose data is here: https://github.com/Future-House/lab-bench and https://huggingface.co/datasets/futurehouse/lab-bench/viewer/LitQA2

Let's have part of resolving this ticket be to document this better in the README.

iii) Lastly, the human and non RAG results on lit-qa 2 in the lab bench paper seem different that in paper-qa2 ... I am not sure if I am missing something, but can you please say a bit more on this?

Can you clarify what differences you're observing?

amanpreet692 commented 3 days ago

Thanks for the prompt reply! Some follow ups please -

We have a bunch if tickets here on making some docs for end-to-end usage of the code. For now, you can look at the unit tests for direct access to the code.

Could you kindly point me to a relevant test case?

Yeah they are from the LAB-Bench paper, whose data is here: https://github.com/Future-House/lab-bench and https://huggingface.co/datasets/futurehouse/lab-bench/viewer/LitQA2

Thanks! However, the data set only consists of the ground truth answers if I am not mistaken? I was more interested in comparing another RAG based QA system with Paper QA 2; since all 248 questions are not public and the paper reports the cumulative score. I'd really appreciate if you have any pointers for the same.

Can you clarify what differences you're observing?

Sure! One instance would be that Human accuracy in Fig 1B in paper qa 2 is reported to be 63.1 whereas it's reported 70 in the lab bench paper Table 2 in the appendix.

Thanks again for the great work!

Future-House / paper-qa