websearch: inject the URLs too?

SillyTavern / SillyTavern-Extras

Extensions API for SillyTavern.

GNU Affero General Public License v3.0

549 stars 124 forks source link

websearch: inject the URLs too? #202

Closed Technologicat closed 8 months ago

Technologicat commented 9 months ago

Sorry for posting a lot in a short time, but there's one more idea that came up in my initial testing:

Currently, websearch only injects the text, and discards the URLs where the matches came from. Sometimes, it would be useful to have the URLs available in the prompt - for example, when querying for the URL of some particular piece of open source documentation. Sure, I could use a search engine the traditional way instead, but it would be nice for this shiny new technology to support that use case, too.

I quickly looked through the source code (SillyTavern-extras/modules/websearch/script.py, SillyTavern-extras/server.py, and SillyTavern/public/scripts/extensions/third-party/Extension-WebSearch/index.js), and I can understand why it's like that. It seems nontrivial to extract the links together with the relevant surrounding text, at least by the CSS filtering approach that is currently used.

Still, maybe something to consider later.

Cohee1207 commented 9 months ago

I have a web crawler stashed for the next release that could access the links returned by web search. It's not exactly a "read that URL" kind of thing because that could lead to some targeted injection or privacy breaching attacks.

Technologicat commented 9 months ago

Ok, that's interesting. Feeding in material from the internet would be useful.

However that's a bit different from what I meant - I'd like to be able to ask e.g. where a particular piece of documentation is available on the web, and get the AI to give me a clickable URL that I can then open in a web browser.

Cohee1207 commented 9 months ago

That part with "get the AI to give me a clickable URL" is prone to hallucinations, especially with small models. It can give you 1. non-working 2. outdated 3. just wrong links.

Technologicat commented 9 months ago

Yes, definitely, that's what happens when the LLM is tasked to generate URLs. Being essentially a fancy autocomplete machine, the model will just make up something that plausibly sounds like it came from its training distribution.

My intuition here was to avoid the "generate". Having the actual correct link injected into the prompt (from the web search) should make the model less likely to hallucinate, since this transforms the task into rephrasing information that is already available in the context.

It's a fair point that LLMs are still rather unreliable. And I haven't tested the success rate for this approach. To think of it, I could rather easily run a bunch of tests by hand-crafting the raw prompt in ooba's notebook mode. Perhaps I should do that.

I have to say that since Mistral became a thing, 7Bs have come a surprisingly long way during the last few months, but it may be that my expectations for them are nevertheless a tad optimistic. :)

Technologicat commented 8 months ago

I see you've added this too - the links=on option does what I intended, as well as includes text from the linked page into the search results, which is probably a better solution than just a bare link.

Thanks! I'll experiment around with this.

Implemented, so closing the ticket.