ItzCrazyKns / Perplexica

Perplexica is an AI-powered search engine. It is an Open source alternative to Perplexity AI
MIT License
16.46k stars 1.54k forks source link

Content gathering from searXNG response websearch #378

Closed om-scogo closed 1 month ago

om-scogo commented 1 month ago

Hey @ItzCrazyKns,

I wanted to commend you on this repository—it's a gem. The project structure is well-organized, and the logic is solid. I recently forked your code, built it locally, and ran some tests—everything worked seamlessly!

While reviewing the academicSearchAgent.ts file in the src/agents folder, I noticed that you're using the 'content' from the SearXNG response directly in the document's page_content. However, it seems like the snippet contains only a small portion of text. I might be missing some detail here, so please feel free to correct me if I’m wrong, but could you explain how this limited content is enough to serve as context in your implementation?

Screenshot 2024-09-29 at 8 45 37 PM

Thanks!

ItzCrazyKns commented 1 month ago

I thought about this for a while and was sure someone’s definitely gonna ask it. So I'll explain it in detail:

First of all, let's talk about how Perplexity does it. Perplexity maintains its own index, built by its crawlers or open data, who knows, and performs searches on it just like Google does, but exposes more content than Google, and thus the model has more data to write up an answer on. It uses something more than a simple keyword-based search, not something AI-related for sure. It’s able to do it at lightning speed, which tells me they’re using something more than a simple keyword-based search and their models are getting more data.

Now, in Perplexica, I can't maintain an open-source search index like Google, Bing, and Perplexity have, so I had to rely on popular search engines to gather data to answer the question. That being said, I chose SearXNG, which is a metadata search engine and gathers data from popular search engines like Google, Bing, Brave, Yahoo, etc. Initially, I had decided to scrape each website, perform similarity searches, and then gather data for answering, but that would take too long, like tooo long, so I decided not to do it. I had to find another way to tackle the problem, so all I did was perform a similarity search on the content received from SearXNG and then pass it to the model for answering.

Now the question arises, is this beneficial? It indeed is. See, most search engines perform keyword-based searches, and we are able to get the data we need from multiple sources, at least most of it. It returns a short snippet of the website, and the model is able to generate an answer from that snippet. We get data from 10-15 snippets, which is enough to create sufficient knowledge for the model to write up an answer. Most models are aware of things and are able to infer information in between, which we call hallucinating. This happens very rarely when not enough data is scraped.

What is the solution for it? The Co-pilot feature I’m working on will bring something close to scraping the whole website. It won’t be exactly like how Perplexity’s Co-pilot feature works, but it will be something of its own and will provide really good quality answers.

Those were all my thoughts. Have a good day!

tqdo commented 1 month ago

Will Co-pilot involve scraping any whole website?