apify / crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://crawlee.dev
Apache License 2.0
12.95k stars 564 forks source link

ERROR The function passed to Apify.main() threw an exception: Error: spawn ps ENOENT #754

Closed gvojtko closed 3 years ago

gvojtko commented 3 years ago

Hi, I have a problem running the spider. I also tried the basic https://sdk.apify.com/docs/examples/basic-crawler (new Apify.BasicCrawler) and puppeteer https://sdk.apify.com/docs/examples/puppeteer-crawler (new Apify.PuppeteerCrawler). None works. If I try await Apify.launchPuppeteer it works.

Apify running inside Docker container.

INFO  System info {"apifyVersion":"0.21.1","apifyClientVersion":"0.6.0","osType":"Linux","nodeVersion":"v12.18.2"}
WARN  Neither APIFY_LOCAL_STORAGE_DIR nor APIFY_TOKEN environment variable is set, defaulting to APIFY_LOCAL_STORAGE_DIR="/home/node/app/apify_storage"
INFO  PuppeteerCrawler: Final request statistics: {"avgDurationMillis":null,"perMinute":0,"finished":0,"failed":0,"retryHistogram":[]}
ERROR The function passed to Apify.main() threw an exception:
  Error: spawn ps ENOENT
      at Process.ChildProcess._handle.onexit (internal/child_process.js:267:19)
      at onErrorNT (internal/child_process.js:469:16)
      at processTicksAndRejections (internal/process/task_queues.js:84:21)
mnmkng commented 3 years ago

Hi, apparently you're running a very small docker image that does not include the ps program. You can use one of our images that come preinstalled with everything that's needed. You can visit the repo to see their source.

Small image that does not include Chrome browser. Good for use with CheerioCrawler.

FROM apify/actor-node-basic

Image with Chrome, to be used with PuppeteerCrawler and headless: true

FROM apify/actor-node-chrome

Image with XVFB for use with a headful browser (headless: false).

FROM apify/actor-node-chrome-xvfb

Or you can use any other image. Just make sure it has the most common linux libraries, such as ps installed.

gvojtko commented 3 years ago

My Dockerfile. Chrome installed. Puppeteer works well, but without crawler wrapper.

FROM node:12-slim

ARG project_root=.

RUN apt-get update

RUN apt-get update \
    && apt-get install -y wget gnupg \
    && wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add - \
    && sh -c 'echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google.list' \
    && apt-get update \
    && apt-get install -y google-chrome-unstable fonts-ipafont-gothic fonts-wqy-zenhei fonts-thai-tlwg fonts-kacst fonts-freefont-ttf libxss1 \
      --no-install-recommends \
    && rm -rf /var/lib/apt/lists/*

# for https
RUN apt-get install -yyq ca-certificates

# install libraries
RUN apt-get install -yyq libasound2 libatk1.0-0 libc6 libcairo2 libcups2 libdbus-1-3 libexpat1 libfontconfig1 libgcc1 libgdk-pixbuf2.0-0 libglib2.0-0 libgtk-3-0 libnspr4 libnss3 libpango-1.0-0 libpangocairo-1.0-0 libstdc++6 libx11-6 libx11-xcb1 libxcb1 libxcomposite1 libxcursor1 libxdamage1 libxext6 libxfixes3 libxi6 libxrandr2 libxrender1 libxss1 libxtst6

# tools
RUN apt-get install -yyq wget xdg-utils

# and fonts
RUN apt-get install -yyq fonts-liberation

USER node

RUN mkdir -p /home/node/app && \
    chown -R node:node /home/node/app && \
    mkdir /home/node/app/node_modules && \
    chown -R node:node /home/node/app/node_modules && \
    mkdir /home/node/.npm-global && \
    chown -R node:node /home/node/.npm-global

ENV PATH=/home/node/.npm-global/bin:$PATH
ENV NPM_CONFIG_PREFIX=/home/node/.npm-global

WORKDIR /home/node/app

COPY ${project_root}/browserless-rest-api /home/node/app

RUN npm install --quiet --no-progress --global npm@latest
RUN npm install --quiet --no-progress --global nodemon
RUN npm install --quiet --no-progress --global

COPY --chown=node:node . .

EXPOSE 8080

CMD ["nodemon", "--legacy-watch", "server.js"]

It's possible set browserWSEndpoint for puppeteerLauncher?

mnmkng commented 3 years ago

I'm not sure why Crawler would not work. Could you provide more details?

Regarding the browserWSEndpoint. Yes, it will work. You'll need to provide a Custom launchPuppeteerFunction and use puppeteer.connect() there and return its return value (the Browser instance) from the launchPuppeteerFunction

gvojtko commented 3 years ago

Hi, thanks for reply. I have another problem now. If I run the following script for the first time, it's fine. If I run the script again, no page is scraped.

Script:

const Apify = require('apify');

Apify.main(async () => {
    const sources = [
        'https://apify.com/store?category=TRAVEL',
        'https://apify.com/store?category=ECOMMERCE',
        'https://apify.com/store?category=ENTERTAINMENT',
    ];

    const requestList = await Apify.openRequestList('categories', sources);
    const requestQueue = await Apify.openRequestQueue();

    const crawler = new Apify.CheerioCrawler({
        maxRequestsPerCrawl: 50,
        requestList,
        requestQueue,
        handlePageFunction: async ({ $, request }) => {
            console.log(`Processing ${request.url}`);

            // This is our new scraping logic.
            if (request.userData.detailPage) {
                const urlArr = request.url.split('/').slice(-2);

                const results = {
                    url: request.url,
                    uniqueIdentifier: urlArr.join('/'),
                    owner: urlArr[0],
                    title: $('header h1').text(),
                    description: $('header p[class^=Text__Paragraph]').text(),
                    lastRunDate: new Date(
                        Number(
                            $('time')
                                .eq(1)
                                .attr('datetime'),
                        ),
                    ),
                    runCount: Number(
                        $('ul.stats li:nth-of-type(3)')
                            .text()
                            .match(/\d+/)[0],
                    ),
                };
                console.log('RESULTS', results);
            }

            // Only enqueue new links from the category pages.
            if (!request.userData.detailPage) {
                await Apify.utils.enqueueLinks({
                    $,
                    requestQueue,
                    selector: 'div.item > a',
                    baseUrl: request.loadedUrl,
                    transformRequestFunction: req => {
                        req.userData.detailPage = true;
                        return req;
                    },
                });
            }
        },
    });

    await crawler.run();
});

Result of first code execuion:

INFO  System info {"apifyVersion":"0.21.3","apifyClientVersion":"0.6.0","osType":"Linux","nodeVersion":"v12.18.2"}
WARN  Neither APIFY_LOCAL_STORAGE_DIR nor APIFY_TOKEN environment variable is set, defaulting to APIFY_LOCAL_STORAGE_DIR="/home/node/app/apify_storage"
INFO  CheerioCrawler:AutoscaledPool:Snapshotter: Setting max memory of this run to 498 MB. Use the APIFY_MEMORY_MBYTES environment variable to override it.
INFO  CheerioCrawler:AutoscaledPool: state {"currentConcurrency":0,"desiredConcurrency":2,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":null},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.7,"actualRatio":null},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":null},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":null}}}
Processing https://apify.com/store?category=TRAVEL
Processing https://apify.com/store?category=ECOMMERCE
Processing https://apify.com/store?category=ENTERTAINMENT
Processing https://apify.com/drobnikj/crawler-google-places
Processing https://apify.com/dtrungtin/airbnb-scraper
Processing https://apify.com/maxcopell/tripadvisor
Processing https://apify.com/eaglejohn/booking-scraper-copy
Processing https://apify.com/dtrungtin/booking-scraper
Processing https://apify.com/lukaskrivka/foursquare-reviews
Processing https://apify.com/vaclavrut/amazon-crawler
Processing https://apify.com/jakubbalada/content-checker
Processing https://apify.com/tugkan/aliexpress-scraper
Processing https://apify.com/jaroslavhejlek/kickstarter-search
Processing https://apify.com/scaleleap/zine-not-amazon-scraper
Processing https://apify.com/emastra/google-shopping-scraper
Processing https://apify.com/lukaskrivka/images-download-upload
Processing https://apify.com/tugkan/asos-scraper
Processing https://apify.com/emastra/actor-autotrader-scraper
Processing https://apify.com/emastra/hm-scraper
Processing https://apify.com/mihails/amazon-bestsellers-scraper
Processing https://apify.com/vaclavrut/alza-cz
Processing https://apify.com/vaclavrut/mall-cz
Processing https://apify.com/emastra/forever21-scraper
Processing https://apify.com/trudax/actor-nordstrom-scraper
Processing https://apify.com/petr_cermak/mironet-scraper
Processing https://apify.com/bernardo/youtube-scraper
Processing https://apify.com/tugkan/gutenberg-scraper
Processing https://apify.com/dtrungtin/imdb-scraper
Processing https://apify.com/sergeylukin/steam-puppeteer
Processing https://apify.com/vaclavrut/cernyrytir
Processing https://apify.com/c_inconnu/deezer-playlist-history
Processing https://apify.com/tugkan/edx-scraper
INFO  CheerioCrawler: All the requests from request list and/or request queue have been processed, the crawler will shut down.
INFO  CheerioCrawler: Final request statistics: {"avgDurationMillis":734,"perMinute":170,"finished":32,"failed":0,"retryHistogram":[32]}

Result of repeated code execution:

INFO  System info {"apifyVersion":"0.21.3","apifyClientVersion":"0.6.0","osType":"Linux","nodeVersion":"v12.18.2"}
WARN  Neither APIFY_LOCAL_STORAGE_DIR nor APIFY_TOKEN environment variable is set, defaulting to APIFY_LOCAL_STORAGE_DIR="/home/node/app/apify_storage"
INFO  CheerioCrawler:AutoscaledPool:Snapshotter: Setting max memory of this run to 498 MB. Use the APIFY_MEMORY_MBYTES environment variable to override it.
INFO  CheerioCrawler:AutoscaledPool: state {"currentConcurrency":0,"desiredConcurrency":2,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":null},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.7,"actualRatio":null},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":null},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":null}}}
INFO  CheerioCrawler: All the requests from request list and/or request queue have been processed, the crawler will shut down.
INFO  CheerioCrawler: Final request statistics: {"avgDurationMillis":null,"perMinute":0,"finished":0,"failed":0,"retryHistogram":[]}

Same behavior for PuppeteerCrawler.

gvojtko commented 3 years ago

And another errors.

INFO  Launching Puppeteer {"args":["--no-sandbox","--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36"],"timeout":15000,"headless":false,"defaultViewport":{"width":1366,"height":768}}
ERROR PuppeteerCrawler:PuppeteerPool: Browser launch failed {"id":2}
  Error: Failed to launch the browser process!
  Fontconfig warning: "/etc/fonts/fonts.conf", line 100: unknown element "blank"
  [1224:1224:0727/204459.484649:ERROR:browser_main_loop.cc(1469)] Unable to open X display.
  [0727/204459.492981:ERROR:nacl_helper_linux.cc(308)] NaCl helper process running without a sandbox!
  Most likely you need to configure your SUID sandbox correctly
  [0727/204459.493089:ERROR:nacl_helper_linux.cc(308)] NaCl helper process running without a sandbox!
  Most likely you need to configure your SUID sandbox correctly

  TROUBLESHOOTING: https://github.com/puppeteer/puppeteer/blob/main/docs/troubleshooting.md

      at onClose (/home/node/app/node_modules/apify/node_modules/puppeteer/lib/cjs/puppeteer/node/BrowserRunner.js:193:20)
      at Interface.<anonymous> (/home/node/app/node_modules/apify/node_modules/puppeteer/lib/cjs/puppeteer/node/BrowserRunner.js:183:68)
      at Interface.emit (events.js:327:22)
      at Interface.close (readline.js:416:8)
      at Socket.onend (readline.js:194:10)
      at Socket.emit (events.js:327:22)
      at endReadableNT (_stream_readable.js:1221:12)
      at processTicksAndRejections (internal/process/task_queues.js:84:21)
ERROR PuppeteerCrawler: handleRequestFunction failed, reclaiming failed request back to the list or queue {"url":"https://www.google-analytics.com","retryCount":3}
  Error: Failed to launch the browser process!
  Fontconfig warning: "/etc/fonts/fonts.conf", line 100: unknown element "blank"
  [1224:1224:0727/204459.484649:ERROR:browser_main_loop.cc(1469)] Unable to open X display.
  [0727/204459.492981:ERROR:nacl_helper_linux.cc(308)] NaCl helper process running without a sandbox!
  Most likely you need to configure your SUID sandbox correctly
  [0727/204459.493089:ERROR:nacl_helper_linux.cc(308)] NaCl helper process running without a sandbox!
  Most likely you need to configure your SUID sandbox correctly

  TROUBLESHOOTING: https://github.com/puppeteer/puppeteer/blob/main/docs/troubleshooting.md

      at onClose (/home/node/app/node_modules/apify/node_modules/puppeteer/lib/cjs/puppeteer/node/BrowserRunner.js:193:20)
      at Interface.<anonymous> (/home/node/app/node_modules/apify/node_modules/puppeteer/lib/cjs/puppeteer/node/BrowserRunner.js:183:68)
      at Interface.emit (events.js:327:22)
      at Interface.close (readline.js:416:8)
      at Socket.onend (readline.js:194:10)
      at Socket.emit (events.js:327:22)
      at endReadableNT (_stream_readable.js:1221:12)
      at processTicksAndRejections (internal/process/task_queues.js:84:21)

and

INFO  Launching Puppeteer {"args":["--no-sandbox","--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36"],"timeout":15000,"headless":false,"defaultViewport":{"width":1366,"height":768}}
ERROR PuppeteerCrawler:PuppeteerPool: Browser launch failed {"id":3}
  Error: Failed to launch the browser process!
  Fontconfig warning: "/etc/fonts/fonts.conf", line 100: unknown element "blank"
  [1795:1795:0727/204613.818868:ERROR:browser_main_loop.cc(1469)] Unable to open X display.

  TROUBLESHOOTING: https://github.com/puppeteer/puppeteer/blob/main/docs/troubleshooting.md

      at onClose (/home/node/app/node_modules/apify/node_modules/puppeteer/lib/cjs/puppeteer/node/BrowserRunner.js:193:20)
      at ChildProcess.<anonymous> (/home/node/app/node_modules/apify/node_modules/puppeteer/lib/cjs/puppeteer/node/BrowserRunner.js:184:79)
      at ChildProcess.emit (events.js:327:22)
      at Process.ChildProcess._handle.onexit (internal/child_process.js:275:12)
Request https://news.ycombinator.com/ failed too many times
INFO  PuppeteerCrawler: All the requests from request list and/or request queue have been processed, the crawler will shut down.
INFO  PuppeteerCrawler: Final request statistics: {"avgDurationMillis":null,"perMinute":0,"finished":0,"failed":1,"retryHistogram":[null,null,null,1]}
ERROR PuppeteerCrawler:PuppeteerPool: Cannot close the browsers.
  Error: Failed to launch the browser process!
  Fontconfig warning: "/etc/fonts/fonts.conf", line 100: unknown element "blank"
  [1756:1756:0727/204613.506537:ERROR:browser_main_loop.cc(1469)] Unable to open X display.
  [0727/204613.517190:ERROR:nacl_helper_linux.cc(308)] NaCl helper process running without a sandbox!
  Most likely you need to configure your SUID sandbox correctly
  [0727/204613.517401:ERROR:nacl_helper_linux.cc(308)] NaCl helper process running without a sandbox!
  Most likely you need to configure your SUID sandbox correctly

  TROUBLESHOOTING: https://github.com/puppeteer/puppeteer/blob/main/docs/troubleshooting.md

      at onClose (/home/node/app/node_modules/apify/node_modules/puppeteer/lib/cjs/puppeteer/node/BrowserRunner.js:193:20)
      at Interface.<anonymous> (/home/node/app/node_modules/apify/node_modules/puppeteer/lib/cjs/puppeteer/node/BrowserRunner.js:183:68)
      at Interface.emit (events.js:327:22)
      at Interface.close (readline.js:416:8)
      at Socket.onend (readline.js:194:10)
      at Socket.emit (events.js:327:22)
      at endReadableNT (_stream_readable.js:1221:12)
      at processTicksAndRejections (internal/process/task_queues.js:84:21)
Crawler finished.
mnmkng commented 3 years ago

See the getting started. You need to run it with

apify run -p

or delete the contents of ./apify_storage after each run.

Regarding the Puppeteer errors, please see the TROUBLESHOOTING link provided in the error messages or use one of the pre-configured docker images I linked above. This looks like your Dockerfile does not include all the necessary libraries and that's not an issue with Apify.

gvojtko commented 3 years ago

Thanks, I will try it.

mnmkng commented 3 years ago

I'll close this since there's no bug. Feel free to continue conversation, if you have any more issues with running the SDK.