Feat: Local web scraping without relying on third-party providers such as SERP

rihp commented 6 months ago

Is your feature request related to a problem? Please describe. execute web scraping tasks on a local machine without relying on third-party providers such as SERP

SERP cost money Describe the solution you'd like Selenium driver is the current go to for webscraping locally, pretty well documented so implementation on a local machine should be pretty straight forward

Describe alternatives you've considered beautiful soup, an http request with parsing scripts to analyze the content of the pages similar to the polywrap web scrape wrapper. Duckduck go offers an api which is somewhat decent, but its again a 3rd party

Additional context requested by Glitch and Devil on Discord https://discord.com/channels/1146873191969587220/1184124197853732874

orishim commented 6 months ago

Here is one approach that uses Puppeteer to browse and take screenshots that are passed to GPT4V

GPT4V + Puppeteer = AI agent browse web like human? 🤖

Mayorc1978 commented 4 months ago

I also would suggest Browserless api, the cloud service is expensive but if you host it with docker you can have both Puppeteer and Playwright endpoints, allowing you to specify a remote location for chrome via the browserWSEndpoint option. Setting this for Browserless is a single line of code change. With a connection like this: Puppeteer

const browser = await puppeteer.launch();
const browser = await puppeteer.connect({ browserWSEndpoint: 'ws://localhost:3000' });

Playwright

const browser = await pw.chromium.launch();
const browser = await pw.chromium.connect('ws://localhost:3000/playwright/chromium');

agentcoinorg / evo.ninja

Feat: Local web scraping without relying on third-party providers such as SERP #582