apify / crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://crawlee.dev
Apache License 2.0
14.93k stars 621 forks source link

executablePath in PuppeteerCrawler #738

Closed AdnanCukur closed 4 years ago

AdnanCukur commented 4 years ago

There is no way to select to use my existing chrome installation with a specific profile

f.ex:

"C:\Program Files (x86)\Google\Chrome\Application\chrome.exe" --profile-directory="Profile 2"

new Apify.PuppeteerCrawler({ requestQueue, launchPuppeteerOptions: { useChrome: true, stealth: true, headless: false, }, })

mnmkng commented 4 years ago

You can use all the options of puppeteer.launch() in LaunchPuppeteerOptions. Instead of useChrome: true, use executablePath: your path

AdnanCukur commented 4 years ago

Yes, it still doesnt open my profile though, the chrome instance that gets opened has no profiles connected. I copied and pasted the command that my personal chrome instance shortcut has.

executablePath: 'C:\\Program Files (x86)\\Google\\Chrome\\Application\\chrome.exe',
args: ['--profile-directory=Default']

Basicly I would like apify to use my personal chrome browser with all my extensions, sessions, cookies and where i'm still logged into google.

This doesnt work.

I've come across a site that uses recaptcha 3, and after a while it starts blocking all request from my scraper, it starts working again after a few hours, but not nearly long enough for me to finish scraping.

This site doesnt seem to work for me at all from incognito mode, i've probably already screwed up my ip, so google doesnt trust it when browsing anonymously. Everything works fine when using my regular browser profile and i can spam click the site how long i want without it blocking me.

Yes i've tried stealth mode, and i dont use headless.

Do you know if its possible to use apify with firefox instead of chrome ?

mnmkng commented 4 years ago

ReCaptcha allows only a limited amount of requests from a single IP per hour/day. I'd advise against using your personal profile for scraping, because it can get flagged and after a while, you'll be doing ReCaptcha challenges yourself while browsing the web. I'm speaking from experience.

Have you tried your approach with plain puppeteer? apify supports everything puppeteer supports and we don't do any magic with executablePath or args, so the fact that it doesn't work may have nothing to do with apify. I did a quick google search of adding a profile directory and there doesn't seem to be a definitive guide for all environments.

And yes, you can use apify with Firefox using the puppeteer product option. Make sure to install Firefox with puppeteer using PUPPETEER_PRODUCT=firefox npm install first.

AdnanCukur commented 4 years ago

Thank you ill try that.

This probably isnt an apify issue, so i'll close this. Thanks for all the help.

mbledkowski commented 1 year ago

Hi, If someone else would find themselves here, here is how to set it in PlaywrightCrawler object

let browser = "/usr/bin/chromium";
if (os.type() === 'Linux' && os.version().includes('NixOS')) {
  browser = "/run/current-system/sw/bin/chromium";
}

const crawler = new PlaywrightCrawler({
  launchContext: {
    launchOptions: {
      executablePath: browser,
    }
  },
})