Closed gvojtko closed 3 years ago
Hi, apparently you're running a very small docker image that does not include the ps
program. You can use one of our images that come preinstalled with everything that's needed. You can visit the repo to see their source.
Small image that does not include Chrome browser. Good for use with CheerioCrawler
.
FROM apify/actor-node-basic
Image with Chrome, to be used with PuppeteerCrawler
and headless: true
FROM apify/actor-node-chrome
Image with XVFB for use with a headful browser (headless: false
).
FROM apify/actor-node-chrome-xvfb
Or you can use any other image. Just make sure it has the most common linux libraries, such as ps
installed.
My Dockerfile. Chrome installed. Puppeteer works well, but without crawler wrapper.
FROM node:12-slim
ARG project_root=.
RUN apt-get update
RUN apt-get update \
&& apt-get install -y wget gnupg \
&& wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add - \
&& sh -c 'echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google.list' \
&& apt-get update \
&& apt-get install -y google-chrome-unstable fonts-ipafont-gothic fonts-wqy-zenhei fonts-thai-tlwg fonts-kacst fonts-freefont-ttf libxss1 \
--no-install-recommends \
&& rm -rf /var/lib/apt/lists/*
# for https
RUN apt-get install -yyq ca-certificates
# install libraries
RUN apt-get install -yyq libasound2 libatk1.0-0 libc6 libcairo2 libcups2 libdbus-1-3 libexpat1 libfontconfig1 libgcc1 libgdk-pixbuf2.0-0 libglib2.0-0 libgtk-3-0 libnspr4 libnss3 libpango-1.0-0 libpangocairo-1.0-0 libstdc++6 libx11-6 libx11-xcb1 libxcb1 libxcomposite1 libxcursor1 libxdamage1 libxext6 libxfixes3 libxi6 libxrandr2 libxrender1 libxss1 libxtst6
# tools
RUN apt-get install -yyq wget xdg-utils
# and fonts
RUN apt-get install -yyq fonts-liberation
USER node
RUN mkdir -p /home/node/app && \
chown -R node:node /home/node/app && \
mkdir /home/node/app/node_modules && \
chown -R node:node /home/node/app/node_modules && \
mkdir /home/node/.npm-global && \
chown -R node:node /home/node/.npm-global
ENV PATH=/home/node/.npm-global/bin:$PATH
ENV NPM_CONFIG_PREFIX=/home/node/.npm-global
WORKDIR /home/node/app
COPY ${project_root}/browserless-rest-api /home/node/app
RUN npm install --quiet --no-progress --global npm@latest
RUN npm install --quiet --no-progress --global nodemon
RUN npm install --quiet --no-progress --global
COPY --chown=node:node . .
EXPOSE 8080
CMD ["nodemon", "--legacy-watch", "server.js"]
It's possible set browserWSEndpoint for puppeteerLauncher?
I'm not sure why Crawler would not work. Could you provide more details?
Regarding the browserWSEndpoint. Yes, it will work. You'll need to provide a Custom launchPuppeteerFunction
and use puppeteer.connect()
there and return its return value (the Browser instance) from the launchPuppeteerFunction
Hi, thanks for reply. I have another problem now. If I run the following script for the first time, it's fine. If I run the script again, no page is scraped.
Script:
const Apify = require('apify');
Apify.main(async () => {
const sources = [
'https://apify.com/store?category=TRAVEL',
'https://apify.com/store?category=ECOMMERCE',
'https://apify.com/store?category=ENTERTAINMENT',
];
const requestList = await Apify.openRequestList('categories', sources);
const requestQueue = await Apify.openRequestQueue();
const crawler = new Apify.CheerioCrawler({
maxRequestsPerCrawl: 50,
requestList,
requestQueue,
handlePageFunction: async ({ $, request }) => {
console.log(`Processing ${request.url}`);
// This is our new scraping logic.
if (request.userData.detailPage) {
const urlArr = request.url.split('/').slice(-2);
const results = {
url: request.url,
uniqueIdentifier: urlArr.join('/'),
owner: urlArr[0],
title: $('header h1').text(),
description: $('header p[class^=Text__Paragraph]').text(),
lastRunDate: new Date(
Number(
$('time')
.eq(1)
.attr('datetime'),
),
),
runCount: Number(
$('ul.stats li:nth-of-type(3)')
.text()
.match(/\d+/)[0],
),
};
console.log('RESULTS', results);
}
// Only enqueue new links from the category pages.
if (!request.userData.detailPage) {
await Apify.utils.enqueueLinks({
$,
requestQueue,
selector: 'div.item > a',
baseUrl: request.loadedUrl,
transformRequestFunction: req => {
req.userData.detailPage = true;
return req;
},
});
}
},
});
await crawler.run();
});
Result of first code execuion:
INFO System info {"apifyVersion":"0.21.3","apifyClientVersion":"0.6.0","osType":"Linux","nodeVersion":"v12.18.2"}
WARN Neither APIFY_LOCAL_STORAGE_DIR nor APIFY_TOKEN environment variable is set, defaulting to APIFY_LOCAL_STORAGE_DIR="/home/node/app/apify_storage"
INFO CheerioCrawler:AutoscaledPool:Snapshotter: Setting max memory of this run to 498 MB. Use the APIFY_MEMORY_MBYTES environment variable to override it.
INFO CheerioCrawler:AutoscaledPool: state {"currentConcurrency":0,"desiredConcurrency":2,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":null},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.7,"actualRatio":null},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":null},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":null}}}
Processing https://apify.com/store?category=TRAVEL
Processing https://apify.com/store?category=ECOMMERCE
Processing https://apify.com/store?category=ENTERTAINMENT
Processing https://apify.com/drobnikj/crawler-google-places
Processing https://apify.com/dtrungtin/airbnb-scraper
Processing https://apify.com/maxcopell/tripadvisor
Processing https://apify.com/eaglejohn/booking-scraper-copy
Processing https://apify.com/dtrungtin/booking-scraper
Processing https://apify.com/lukaskrivka/foursquare-reviews
Processing https://apify.com/vaclavrut/amazon-crawler
Processing https://apify.com/jakubbalada/content-checker
Processing https://apify.com/tugkan/aliexpress-scraper
Processing https://apify.com/jaroslavhejlek/kickstarter-search
Processing https://apify.com/scaleleap/zine-not-amazon-scraper
Processing https://apify.com/emastra/google-shopping-scraper
Processing https://apify.com/lukaskrivka/images-download-upload
Processing https://apify.com/tugkan/asos-scraper
Processing https://apify.com/emastra/actor-autotrader-scraper
Processing https://apify.com/emastra/hm-scraper
Processing https://apify.com/mihails/amazon-bestsellers-scraper
Processing https://apify.com/vaclavrut/alza-cz
Processing https://apify.com/vaclavrut/mall-cz
Processing https://apify.com/emastra/forever21-scraper
Processing https://apify.com/trudax/actor-nordstrom-scraper
Processing https://apify.com/petr_cermak/mironet-scraper
Processing https://apify.com/bernardo/youtube-scraper
Processing https://apify.com/tugkan/gutenberg-scraper
Processing https://apify.com/dtrungtin/imdb-scraper
Processing https://apify.com/sergeylukin/steam-puppeteer
Processing https://apify.com/vaclavrut/cernyrytir
Processing https://apify.com/c_inconnu/deezer-playlist-history
Processing https://apify.com/tugkan/edx-scraper
INFO CheerioCrawler: All the requests from request list and/or request queue have been processed, the crawler will shut down.
INFO CheerioCrawler: Final request statistics: {"avgDurationMillis":734,"perMinute":170,"finished":32,"failed":0,"retryHistogram":[32]}
Result of repeated code execution:
INFO System info {"apifyVersion":"0.21.3","apifyClientVersion":"0.6.0","osType":"Linux","nodeVersion":"v12.18.2"}
WARN Neither APIFY_LOCAL_STORAGE_DIR nor APIFY_TOKEN environment variable is set, defaulting to APIFY_LOCAL_STORAGE_DIR="/home/node/app/apify_storage"
INFO CheerioCrawler:AutoscaledPool:Snapshotter: Setting max memory of this run to 498 MB. Use the APIFY_MEMORY_MBYTES environment variable to override it.
INFO CheerioCrawler:AutoscaledPool: state {"currentConcurrency":0,"desiredConcurrency":2,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":null},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.7,"actualRatio":null},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":null},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":null}}}
INFO CheerioCrawler: All the requests from request list and/or request queue have been processed, the crawler will shut down.
INFO CheerioCrawler: Final request statistics: {"avgDurationMillis":null,"perMinute":0,"finished":0,"failed":0,"retryHistogram":[]}
Same behavior for PuppeteerCrawler.
And another errors.
INFO Launching Puppeteer {"args":["--no-sandbox","--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36"],"timeout":15000,"headless":false,"defaultViewport":{"width":1366,"height":768}}
ERROR PuppeteerCrawler:PuppeteerPool: Browser launch failed {"id":2}
Error: Failed to launch the browser process!
Fontconfig warning: "/etc/fonts/fonts.conf", line 100: unknown element "blank"
[1224:1224:0727/204459.484649:ERROR:browser_main_loop.cc(1469)] Unable to open X display.
[0727/204459.492981:ERROR:nacl_helper_linux.cc(308)] NaCl helper process running without a sandbox!
Most likely you need to configure your SUID sandbox correctly
[0727/204459.493089:ERROR:nacl_helper_linux.cc(308)] NaCl helper process running without a sandbox!
Most likely you need to configure your SUID sandbox correctly
TROUBLESHOOTING: https://github.com/puppeteer/puppeteer/blob/main/docs/troubleshooting.md
at onClose (/home/node/app/node_modules/apify/node_modules/puppeteer/lib/cjs/puppeteer/node/BrowserRunner.js:193:20)
at Interface.<anonymous> (/home/node/app/node_modules/apify/node_modules/puppeteer/lib/cjs/puppeteer/node/BrowserRunner.js:183:68)
at Interface.emit (events.js:327:22)
at Interface.close (readline.js:416:8)
at Socket.onend (readline.js:194:10)
at Socket.emit (events.js:327:22)
at endReadableNT (_stream_readable.js:1221:12)
at processTicksAndRejections (internal/process/task_queues.js:84:21)
ERROR PuppeteerCrawler: handleRequestFunction failed, reclaiming failed request back to the list or queue {"url":"https://www.google-analytics.com","retryCount":3}
Error: Failed to launch the browser process!
Fontconfig warning: "/etc/fonts/fonts.conf", line 100: unknown element "blank"
[1224:1224:0727/204459.484649:ERROR:browser_main_loop.cc(1469)] Unable to open X display.
[0727/204459.492981:ERROR:nacl_helper_linux.cc(308)] NaCl helper process running without a sandbox!
Most likely you need to configure your SUID sandbox correctly
[0727/204459.493089:ERROR:nacl_helper_linux.cc(308)] NaCl helper process running without a sandbox!
Most likely you need to configure your SUID sandbox correctly
TROUBLESHOOTING: https://github.com/puppeteer/puppeteer/blob/main/docs/troubleshooting.md
at onClose (/home/node/app/node_modules/apify/node_modules/puppeteer/lib/cjs/puppeteer/node/BrowserRunner.js:193:20)
at Interface.<anonymous> (/home/node/app/node_modules/apify/node_modules/puppeteer/lib/cjs/puppeteer/node/BrowserRunner.js:183:68)
at Interface.emit (events.js:327:22)
at Interface.close (readline.js:416:8)
at Socket.onend (readline.js:194:10)
at Socket.emit (events.js:327:22)
at endReadableNT (_stream_readable.js:1221:12)
at processTicksAndRejections (internal/process/task_queues.js:84:21)
and
INFO Launching Puppeteer {"args":["--no-sandbox","--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36"],"timeout":15000,"headless":false,"defaultViewport":{"width":1366,"height":768}}
ERROR PuppeteerCrawler:PuppeteerPool: Browser launch failed {"id":3}
Error: Failed to launch the browser process!
Fontconfig warning: "/etc/fonts/fonts.conf", line 100: unknown element "blank"
[1795:1795:0727/204613.818868:ERROR:browser_main_loop.cc(1469)] Unable to open X display.
TROUBLESHOOTING: https://github.com/puppeteer/puppeteer/blob/main/docs/troubleshooting.md
at onClose (/home/node/app/node_modules/apify/node_modules/puppeteer/lib/cjs/puppeteer/node/BrowserRunner.js:193:20)
at ChildProcess.<anonymous> (/home/node/app/node_modules/apify/node_modules/puppeteer/lib/cjs/puppeteer/node/BrowserRunner.js:184:79)
at ChildProcess.emit (events.js:327:22)
at Process.ChildProcess._handle.onexit (internal/child_process.js:275:12)
Request https://news.ycombinator.com/ failed too many times
INFO PuppeteerCrawler: All the requests from request list and/or request queue have been processed, the crawler will shut down.
INFO PuppeteerCrawler: Final request statistics: {"avgDurationMillis":null,"perMinute":0,"finished":0,"failed":1,"retryHistogram":[null,null,null,1]}
ERROR PuppeteerCrawler:PuppeteerPool: Cannot close the browsers.
Error: Failed to launch the browser process!
Fontconfig warning: "/etc/fonts/fonts.conf", line 100: unknown element "blank"
[1756:1756:0727/204613.506537:ERROR:browser_main_loop.cc(1469)] Unable to open X display.
[0727/204613.517190:ERROR:nacl_helper_linux.cc(308)] NaCl helper process running without a sandbox!
Most likely you need to configure your SUID sandbox correctly
[0727/204613.517401:ERROR:nacl_helper_linux.cc(308)] NaCl helper process running without a sandbox!
Most likely you need to configure your SUID sandbox correctly
TROUBLESHOOTING: https://github.com/puppeteer/puppeteer/blob/main/docs/troubleshooting.md
at onClose (/home/node/app/node_modules/apify/node_modules/puppeteer/lib/cjs/puppeteer/node/BrowserRunner.js:193:20)
at Interface.<anonymous> (/home/node/app/node_modules/apify/node_modules/puppeteer/lib/cjs/puppeteer/node/BrowserRunner.js:183:68)
at Interface.emit (events.js:327:22)
at Interface.close (readline.js:416:8)
at Socket.onend (readline.js:194:10)
at Socket.emit (events.js:327:22)
at endReadableNT (_stream_readable.js:1221:12)
at processTicksAndRejections (internal/process/task_queues.js:84:21)
Crawler finished.
See the getting started. You need to run it with
apify run -p
or delete the contents of ./apify_storage
after each run.
Regarding the Puppeteer errors, please see the TROUBLESHOOTING link provided in the error messages or use one of the pre-configured docker images I linked above. This looks like your Dockerfile does not include all the necessary libraries and that's not an issue with Apify.
Thanks, I will try it.
I'll close this since there's no bug. Feel free to continue conversation, if you have any more issues with running the SDK.
Hi, I have a problem running the spider. I also tried the basic https://sdk.apify.com/docs/examples/basic-crawler (new Apify.BasicCrawler) and puppeteer https://sdk.apify.com/docs/examples/puppeteer-crawler (new Apify.PuppeteerCrawler). None works. If I try await Apify.launchPuppeteer it works.
Apify running inside Docker container.