Open cpbotha opened 5 months ago
This is expected. I appreciate the effort in addressing this, but I'm not sure I would expect the tool to scrape the search results.
Personally I've not been able to come up with another way of giving the LLM access to page contents.
Of course we could give it a PageContentsFetch tool (which one should consider, very helpful when you want to ask it questions about a specific page), but this would come down to the same, just take a bit longer.
Please let me know how you would like to proceed. :)
Personally I've not been able to come up with another way of giving the LLM access to page contents.
Of course we could give it a PageContentsFetch tool (which one should consider, very helpful when you want to ask it questions about a specific page), but this would come down to the same, just take a bit longer.
Please let me know how you would like to proceed. :)
scraping is fine just allow some way to configure proxies (including socks5) for however the scraping is done. Personally I wouldn’t want to host any LLM scraping without rotating proxies at work.
Do you want the scraping logic to support the rotation internally (i.e. get list of proxies from configuration, rotate / randomize over them), or are you OK with a proxy being configurable? (in which case users will have to make use of a proxy-proxy service that rotates the upstreams)
Note to self: Look into https://github.com/TooTallNate/proxy-agents/tree/main/packages/proxy-agent
No need, it just needs to handle simple proxy configuration, ideally both regular and SOCKS5 proxies. A lot of proxy services do the rotating for you and doing that here, internally, might be beyond the scope of the project
I've added proxy-agent which honours the standard environment variables for using proxies, and then selects the right {http,https,socks}-proxy-agent and activated this for the axios.get of the page contents.
How would you like to proceed?
What happened?
The Google Search tool sends only tiny fractions of the search result pages. This does not give the LLM much to work with.
Steps to Reproduce
Configure and add the Google Search plugin to either the Plugins or Assistants (my preference) modes.
Ask a question that will necessitate a web search. Open the result that is sent back to the LLM: This is the raw Google Search results JSON, which only includes page titles and snippets (a tiny extract of the page), but not the actual contents of the search result pages.
PR
I've modified the Google Search tool to extract page contents using Readability.js, and to return all of that to the LLM. See #2419
Code of Conduct