Open-Multi-Modal-Personal-Assistant / OpenMMPA

Open Multi-Modal Personal Assistant
MIT License
4 stars 1 forks source link

Refactor web search to use web scraping instead of Duck Duck Go API #13

Open MrCsabaToth opened 4 months ago

MrCsabaToth commented 4 months ago

I noticed during testing that the Web Search Tool's - which uses Duck Duck Go - current API calls, which follow the https://stackoverflow.com/a/37012658/292502 format (https://api.duckduckgo.com/?q=<your search string>&format=json&pretty=1&no_html=1&skip_disambig=1) several times doesn't return any result (for example for the "What is O'Reilly Auto Part 121G"), even though manually on the web UI it does have data.

The StackOverflow entry states:

This API does not include all of our links, however. That is, it is not a full search results API or a way to get DuckDuckGo results into your applications beyond our instant answers. Because of the way we generate our search results, we unfortunately do not have the rights to fully syndicate our results. For the same reason, we cannot allow framing our results without our branding. Please see our partnerships page for more info on guidelines and getting in touch with us. This is an Instant Answer API, and not a full results API. However, there are some Web links within it, e.g. official sites.

I don't care about links yet, but this syndication seems to blank out many results. So we should probably refactor to a web scraper alternative, what a SerpApi engineer describes: https://stackoverflow.com/a/68379691/292502. This entry contains extremely valuable information, it points out how to obtain a specific vqd token and carry it over to a follow-up call, essentially it'd be a two call solution. However for us the full rich result might not be as important, so we might simply be able to go with the https://links.duckduckgo.com/d.js single call and then scrape?

Since some genius deleted the SerpApi engineer's answer, I'll include it here (in case it'd disappear):

If you are interested in retrieving rich results as well ("Recent News", "Images for query", "Knowledge Graph", etc.), the non-JS web version of DuckDuckGo: https://duckduckgo.com/html/ would NOT provide this for you.

To get the FULL DuckDuckGo page your best option is to query next link: https://links.duckduckgo.com/d.js?. You can find it by inspecting the network tab.

This is where all the results are stored.

Example for search query: "bill gates": https://links.duckduckgo.com/d.js?q=bill%20gates&kl=us-en&l=us-en&s=0&ct=US&ss_mkt=us&vqd=3-41771934349821924699896735607141847775-125937012480658240237583475471092551742

There are two required parameters here. First is the query q, second is the search token vqd.

vqd token is unique to each search, and tied to the query, so you wouldn't be able to reuse the token with the different query. Also, I think that it has an expiration time of 48h.

I did try to reverse engineer the token, but to my best knowledge, it is generated on the server side, and there is no way to do this.

One more parameter that is worth mentioning is s, it's used for pagination. It defines the result offset, and skips the given number of results.


So the flow to retrieve the results would be:

  1. Go to: https://duckduckgo.com/?q=bill+gates and get the vqd search token

enter image description here

  1. Use the vqd token to construct the request URL: https://links.duckduckgo.com/d.js

Alternatively, you could use a third party solution like SerpApi, it does all this, and more for you.

Example python code (available in other libraries also):

from serpapi import GoogleSearch

params = {
  "api_key": "secret_api_key",
  "engine": "duckduckgo",
  "q": "bill gates",
  "kl": "us-en"
}

search = GoogleSearch(params)
results = search.get_dict()

Example JSON output:

"organic_results": [
  {
    "position": 1,
    "title": "Bill Gates - Wikipedia",
    "link": "https://en.wikipedia.org/wiki/Bill_Gates",
    "snippet": "Early life. Bill Gates was born in Seattle, Washington, on October 28, 1955. He is the son of William H. Gates Sr. (1925-2020) and Mary Maxwell Gates (1929-1994). His ancestry includes English, German, and Irish/Scots-Irish. His father was a prominent lawyer, and his mother served on the board of directors for First Interstate BancSystem and the United Way of America.",
    "favicon": "https://external-content.duckduckgo.com/ip3/en.wikipedia.org.ico",
    "sitelinks": [
      {
        "title": "Bill Gates Sr",
        "link": "https://en.wikipedia.org/wiki/Bill_Gates_Sr."
      },
      {
        "title": "Bill & Melinda Gates Foundation",
        "link": "https://en.wikipedia.org/wiki/Bill_%26_Melinda_Gates_Foundation"
      },
      {
        "title": "The World's Billionaires",
        "link": "https://en.wikipedia.org/wiki/The_World%27s_Billionaires"
      },
      {
        "title": "Bill Gates's House",
        "link": "https://en.wikipedia.org/wiki/Bill_Gates%27s_house"
      },
      {
        "title": "Mary Maxwell Gates",
        "link": "https://en.wikipedia.org/wiki/Mary_Maxwell_Gates"
      },
      {
        "title": "Paul Allen",
        "link": "https://en.wikipedia.org/wiki/Paul_Allen"
      }
    ]
  },
  ...
],
"knowledge_graph": {
  "title": "Bill Gates",
  "description": "William Henry Gates III is an American business magnate, software developer, investor, author, and philanthropist. He is a co-founder of Microsoft Corporation, along with his late childhood friend Paul Allen. During his career at Microsoft, Gates held the positions of chairman, chief executive officer, president and chief software architect, while also being the largest individual shareholder until May 2014. He is considered one of the best known entrepreneurs of the microcomputer revolution of the 1970s and 1980s. Gates was born and raised in Seattle, Washington. In 1975, he and Allen founded Microsoft in Albuquerque, New Mexico. It became the world's largest personal computer software company. Gates led the company as chairman and CEO until stepping down as CEO in January 2000, succeeded by Steve Ballmer, but he remained chairman of the board of directors and became chief software architect.",
  "website": "https://www.gatesnotes.com",
  "thumbnail": "https://duckduckgo.com/i/45eb7625.jpg",
  "facts": {
    "born": "William Henry Gates III, October 28, 1955, Seattle, Washington, U.S.",
    "education": "Harvard University (dropped out)",
    "occupation": "Software developer, investor, entrepreneur",
    ...
  },
  ...
},
"news_results": [
  {
    "position": 1,
    "title": "Bill Gates admitted 'messing up' marriage at 'moving' Sun Valley panel, report claims",
    "link": "https://news.yahoo.com/bill-gates-admitted-messing-marriage-174024577.html",
    "snippet": "Microsoft co-founder Bill Gates has purportedly confessed to \"messing up\" his marriage to his ex-wife Melinda French Gates, according to a report. The New York Post reported that the tech giant made the comments during an \"off the record\" question-and-answer session at an exclusive Sun Valley conference last week.",
    "source": "YAHOO!News",
    "date": "20 hours ago",
    "thumbnail": "https://s.yimg.com/uu/api/res/1.2/56GaZKhuq0Dq.TS8WNZL9g--~B/aD03Njg7dz0xMDI0O2FwcGlkPXl0YWNoeW9u/https://media.zenfs.com/en/the_independent_635/0b4d2e20d6c2c4ecb1565ae82180ee6a"
  },
  ...
],
"inline_images": [
  {
    "position": 1,
    "title": "Bill Gates says Jobs was a wizard in getting staff to keep ...",
    "link": "https://www.cnbc.com/2019/07/07/bill-gates-says-jobs-was-a-wizard-in-getting-staff-to-keep-apple-alive.html",
    "thumbnail": "https://tse1.mm.bing.net/th?id=OIP.5WUHRMXHSmzjh9W7HP80hQHaE8&pid=Api",
    "image": "https://image.cnbcfm.com/api/v1/image/105894488-15571466884731u8a0002r.jpg?v=1557843920"
  },
  ...
],
"related_searches": [
  {
    "query": "bill gates age",
    "link": "https://duckduckgo.com/?q=bill%20gates%20age"
  },
  {
    "query": "bill gates investments in china",
    "link": "https://duckduckgo.com/?q=bill%20gates%20investments%20in%20china"
  },
  {
    "query": "news about bill gates",
    "link": "https://duckduckgo.com/?q=news%20about%20bill%20gates"
  },
  ...
]

Check out the documentation for more details.

Disclaimer: I work at SerpApi.

MrCsabaToth commented 4 months ago
  1. As far as the scraping itself goes, we could establish a GCP Cloud Function which would use "traditional" Selenium ChormeDriver Python logic.
  2. It'd be better if we can save Cloud Function round-trips and keep it phone local and go native Dart.

Many articles simple use the Dart http package's parser capabilities:

MrCsabaToth commented 4 months ago

I was looking at Chaleno and dart_web_scraper and realized a few things:

  1. I'll need JavaScript capability
  2. DOM search will need element tag (h1) and content based capability. Neither of those packages have that.

Both of these mean that I'll need a Cloud Function (or Cloud Run) with Chrome WebDriver + Selenium capability.

The other questions is what / how will I exactly scrape.

  1. One option is to follow https://stackoverflow.com/a/68379691/292502 (which is quoted in detail in the issue opening and also in https://stackoverflow.com/a/68379691/292502) and is a two round method: obtain the vqd with the first, and then make a https://links.duckduckgo.com/d.js with it.
  2. One round technique, regular search URL and extract the DuckAssist beta summary, which is "Auto-generated based on listed sources. Responses may contain inaccuracies.". I suspect DuckDuckGo makes an LLM summary of the top hits. This is golden if we can grab that!

Screenshot_2024-07-31_22-34-06

MrCsabaToth commented 4 months ago

Looks like the DuckAssist is not turned on right away all the time, we either need to click the Assist button, or just try to add the &assist=true URL parameter https://duckduckgo.com/?q=what+is+palm2&t=h_&ia=web&assist=true

MrCsabaToth commented 4 months ago

The plan was promising, but DuckDuckGo detects the scrape. I was even thinking of https://www.octoparse.com/ or rely on SerpAPI https://serpapi.com/blog/how-to-scrape-duckduckgo-results/. The official DuckDuckGo API only works with very clear and separated entities, it falters (empty result) for anything meaningful. Gemini would be able to answer anything what DuckDuckGo provides with the API endpoint. So without a scrape it's useless.

MrCsabaToth commented 4 months ago

I developed a function based on https://github.com/CsabaConsulting/web-search Articles:

Also note that I needed "allowSyntheticDefaultImports": true to get rid of

node_modules/@types/selenium-webdriver/chromium.d.ts(1,8): error TS1192: Module '"/workspace/node_modules/@types/selenium-webdriver/http"' has no default export. node_modules/@types/selenium-webdriver/lib/webdriver.d.ts(15,8): error TS1192: Module '"/workspace/node_modules/@types/selenium-webdriver/lib/command"' has no default export.; Error ID: 1a2262f3

MrCsabaToth commented 3 months ago

Efforts are in https://github.com/Open-Multi-Modal-Personal-Assistant/web-search now but I haven't progressed for some weeks, other issues here are now priority

MrCsabaToth commented 2 months ago

Now with Firebase functions #53 we might try to move this next to the other functions? I wonder if the GCP selenium trick only works with GCP functions or Firebase too

MrCsabaToth commented 2 months ago

I've found an article about scraping with Playwright (what Rabbit R1 LAM uses) in a Firebase function: https://github.com/Open-Multi-Modal-Personal-Assistant/web-search/issues/3