With the Google Search tool, only the page snippets are sent to the LLM / GPT

danny-avila / LibreChat

Enhanced ChatGPT Clone: Features Anthropic, AWS, OpenAI, Assistants API, Azure, Groq, o1, GPT-4o, Mistral, OpenRouter, Vertex AI, Gemini, Artifacts, AI model switching, message search, langchain, DALL-E-3, ChatGPT Plugins, OpenAI Functions, Secure Multi-User System, Presets, completely open-source for self-hosting. Actively in public development.

https://librechat.ai/

MIT License

18.08k stars 3.02k forks source link

With the Google Search tool, only the page snippets are sent to the LLM / GPT #2420

Open cpbotha opened 5 months ago

cpbotha commented 5 months ago

What happened?

The Google Search tool sends only tiny fractions of the search result pages. This does not give the LLM much to work with.

Steps to Reproduce

Configure and add the Google Search plugin to either the Plugins or Assistants (my preference) modes.

Ask a question that will necessitate a web search. Open the result that is sent back to the LLM: This is the raw Google Search results JSON, which only includes page titles and snippets (a tiny extract of the page), but not the actual contents of the search result pages.

PR

I've modified the Google Search tool to extract page contents using Readability.js, and to return all of that to the LLM. See #2419

Code of Conduct

[X] I agree to follow this project's Code of Conduct

danny-avila commented 5 months ago

This is expected. I appreciate the effort in addressing this, but I'm not sure I would expect the tool to scrape the search results.

cpbotha commented 5 months ago

Personally I've not been able to come up with another way of giving the LLM access to page contents.

Of course we could give it a PageContentsFetch tool (which one should consider, very helpful when you want to ask it questions about a specific page), but this would come down to the same, just take a bit longer.

Please let me know how you would like to proceed. :)

danny-avila commented 5 months ago

Personally I've not been able to come up with another way of giving the LLM access to page contents.

Of course we could give it a PageContentsFetch tool (which one should consider, very helpful when you want to ask it questions about a specific page), but this would come down to the same, just take a bit longer.

Please let me know how you would like to proceed. :)

scraping is fine just allow some way to configure proxies (including socks5) for however the scraping is done. Personally I wouldn’t want to host any LLM scraping without rotating proxies at work.

cpbotha commented 5 months ago

Do you want the scraping logic to support the rotation internally (i.e. get list of proxies from configuration, rotate / randomize over them), or are you OK with a proxy being configurable? (in which case users will have to make use of a proxy-proxy service that rotates the upstreams)

Note to self: Look into https://github.com/TooTallNate/proxy-agents/tree/main/packages/proxy-agent

danny-avila commented 5 months ago

No need, it just needs to handle simple proxy configuration, ideally both regular and SOCKS5 proxies. A lot of proxy services do the rotating for you and doing that here, internally, might be beyond the scope of the project

cpbotha commented 5 months ago

I've added proxy-agent which honours the standard environment variables for using proxies, and then selects the right {http,https,socks}-proxy-agent and activated this for the axios.get of the page contents.

How would you like to proceed?