Future-House / paper-qa

High accuracy RAG for answering questions from scientific documents with citations
Apache License 2.0
6.44k stars 618 forks source link

Incorrect LLM Used and API Environment Variable Warnings #642

Open jukti3742 opened 3 weeks ago

jukti3742 commented 3 weeks ago

When running the script, I encountered two main issues:

  1. The library defaults to using the GPT-4o model despite setting the LLM to gpt-4o-mini, leading to higher costs.
  2. I received warnings about missing environment variables for external APIs, for reasons unknown to me.

Code Snippet:

import os
from paperqa import Docs, Settings
.
.
.
doc_paths = get_absolute_paths(folder_path)
docs = Docs()

for doc in doc_paths:
    docs.add(doc)

settings = Settings()
settings.llm = "gpt-4o-mini"
settings.summary_llm = "gpt-4o-mini"
settings.prompts.qa = my_qa_prompt
# settings.answer.answer_max_sources = 3

answer = docs.query(
    question,
    settings=settings,
)

print(answer.formatted_answer)

Warnings:

SEMANTIC_SCHOLAR_API_KEY environment variable not set. Semantic Scholar API rate limits may apply.
CROSSREF_MAILTO environment variable not set. Crossref API rate limits may apply.
CROSSREF_API_KEY environment variable not set. Crossref API rate limits may apply.
dosubot[bot] commented 3 weeks ago

Hey there, @jukti3742! I'm Dosu, a bot here to help you with bugs, questions, and becoming a contributor. Let's tackle this issue together while we wait for a human maintainer.

  1. LLM Defaulting Issue: The paperqa library defaults to using "gpt-4o-2024-08-06" as the LLM. If your script is still using this model despite setting settings.llm and settings.summary_llm to "gpt-4o-mini", ensure that these settings are correctly applied before the query method is called. Double-check that there are no other parts of your code or library defaults that might override these settings.

  2. Environment Variable Warnings: The warnings about missing environment variables indicate that the library is trying to access external APIs without the necessary credentials. For Semantic Scholar and Crossref, you need to set the SEMANTIC_SCHOLAR_API_KEY, CROSSREF_MAILTO, and CROSSREF_API_KEY environment variables to avoid rate limits. You can set these in your environment using:

    export SEMANTIC_SCHOLAR_API_KEY='your_api_key'
    export CROSSREF_MAILTO='your_email'
    export CROSSREF_API_KEY='your_api_key'

    These variables are crucial for accessing the APIs without hitting rate limits [1][2].

To continue talking to Dosu, mention @dosu.

jukti3742 commented 3 weeks ago

Not helpful, since settings are correctly set to "gpt-4o-mini" before the query method is called. Please see the code snippet.

mmtftr commented 2 weeks ago

For the warnings, please read https://github.com/Future-House/paper-qa#installation

For the model selection issue:

I had the same problem as you, and realized that the documentation is not updated to reflect API changes.

The model settings are actually taken from the agent key of the Settings object, and not the llm or summary_llm as the docs would suggest. The fix is to also give your model name inside that key as follows:


from paperqa.agents.main import AgentSettings

settings=Settings(
  agent=AgentSettings(
    agent_llm="gpt-4o-mini", # your desired LLM
  ),
)

If this fixes your problem, please close the issue, I'll be opening a new issue for the stale docs.