Open MrCsabaToth opened 4 months ago
web_scraper
package (https://fluttergems.dev/packages/web_scraper/) however it seems to be unmaintained, maybe we can find a fresh fork?Many articles simple use the Dart http
package's parser capabilities:
I was looking at Chaleno and dart_web_scraper and realized a few things:
Both of these mean that I'll need a Cloud Function (or Cloud Run) with Chrome WebDriver + Selenium capability.
The other questions is what / how will I exactly scrape.
vqd
with the first, and then make a https://links.duckduckgo.com/d.js
with it.DuckAssist
beta summary, which is "Auto-generated based on listed sources. Responses may contain inaccuracies.". I suspect DuckDuckGo makes an LLM summary of the top hits. This is golden if we can grab that!Looks like the DuckAssist is not turned on right away all the time, we either need to click the Assist
button, or just try to add the &assist=true
URL parameter https://duckduckgo.com/?q=what+is+palm2&t=h_&ia=web&assist=true
The plan was promising, but DuckDuckGo detects the scrape. I was even thinking of https://www.octoparse.com/ or rely on SerpAPI https://serpapi.com/blog/how-to-scrape-duckduckgo-results/. The official DuckDuckGo API only works with very clear and separated entities, it falters (empty result) for anything meaningful. Gemini would be able to answer anything what DuckDuckGo provides with the API endpoint. So without a scrape it's useless.
I developed a function based on https://github.com/CsabaConsulting/web-search Articles:
Also note that I needed "allowSyntheticDefaultImports": true
to get rid of
node_modules/@types/selenium-webdriver/chromium.d.ts(1,8): error TS1192: Module '"/workspace/node_modules/@types/selenium-webdriver/http"' has no default export. node_modules/@types/selenium-webdriver/lib/webdriver.d.ts(15,8): error TS1192: Module '"/workspace/node_modules/@types/selenium-webdriver/lib/command"' has no default export.; Error ID: 1a2262f3
Efforts are in https://github.com/Open-Multi-Modal-Personal-Assistant/web-search now but I haven't progressed for some weeks, other issues here are now priority
Now with Firebase functions #53 we might try to move this next to the other functions? I wonder if the GCP selenium trick only works with GCP functions or Firebase too
I've found an article about scraping with Playwright (what Rabbit R1 LAM uses) in a Firebase function: https://github.com/Open-Multi-Modal-Personal-Assistant/web-search/issues/3
I noticed during testing that the Web Search Tool's - which uses Duck Duck Go - current API calls, which follow the https://stackoverflow.com/a/37012658/292502 format (
https://api.duckduckgo.com/?q=<your search string>&format=json&pretty=1&no_html=1&skip_disambig=1
) several times doesn't return any result (for example for the "What is O'Reilly Auto Part 121G"), even though manually on the web UI it does have data.The StackOverflow entry states:
I don't care about links yet, but this syndication seems to blank out many results. So we should probably refactor to a web scraper alternative, what a SerpApi engineer describes: https://stackoverflow.com/a/68379691/292502. This entry contains extremely valuable information, it points out how to obtain a specific
vqd
token and carry it over to a follow-up call, essentially it'd be a two call solution. However for us the full rich result might not be as important, so we might simply be able to go with thehttps://links.duckduckgo.com/d.js
single call and then scrape?Since some genius deleted the SerpApi engineer's answer, I'll include it here (in case it'd disappear):
If you are interested in retrieving rich results as well ("Recent News", "Images for query", "Knowledge Graph", etc.), the non-JS web version of DuckDuckGo: https://duckduckgo.com/html/ would NOT provide this for you.
To get the FULL DuckDuckGo page your best option is to query next link:
https://links.duckduckgo.com/d.js?
. You can find it by inspecting the network tab.This is where all the results are stored.
Example for search query: "bill gates": https://links.duckduckgo.com/d.js?q=bill%20gates&kl=us-en&l=us-en&s=0&ct=US&ss_mkt=us&vqd=3-41771934349821924699896735607141847775-125937012480658240237583475471092551742
There are two required parameters here. First is the query
q
, second is the search tokenvqd
.vqd
token is unique to each search, and tied to the query, so you wouldn't be able to reuse the token with the different query. Also, I think that it has an expiration time of 48h.I did try to reverse engineer the token, but to my best knowledge, it is generated on the server side, and there is no way to do this.
One more parameter that is worth mentioning is
s
, it's used for pagination. It defines the result offset, and skips the given number of results.So the flow to retrieve the results would be:
vqd
search tokenvqd
token to construct the request URL:https://links.duckduckgo.com/d.js
Alternatively, you could use a third party solution like SerpApi, it does all this, and more for you.
Example python code (available in other libraries also):
Example JSON output:
Check out the documentation for more details.
Disclaimer: I work at SerpApi.