handlePageFunction only worked the first time I ran the script #333

Closed nemo closed 5 years ago

nemo commented 5 years ago

Hey folks,

Excited to properly give Apify a try. It worked literally once for me (and it worked well!), but every subsequent run I've done on the script below just prints the following (which means it never calls the handlePageFunction):

WARNING: Neither APIFY_LOCAL_STORAGE_DIR nor APIFY_TOKEN environment variable is set, defaulting to APIFY_LOCAL_STORAGE_DIR="/Users/nimagardideh/Documents/workspace/poetry/backend/functions/apify_storage"
INFO: AutoscaledPool: Setting max memory of this run to 4096 MB. Use the APIFY_MEMORY_MBYTES environment variable to override it.
INFO: AutoscaledPool state {"currentConcurrency":0,"desiredConcurrency":2,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"maxOverloadedRatio":0.2,"actualRatio":null},"eventLoopInfo":{"isOverloaded":false,"maxOverloadedRatio":0.4,"actualRatio":null},"cpuInfo":{"isOverloaded":false,"maxOverloadedRatio":0.4,"actualRatio":null},"clientInfo":{"isOverloaded":false,"maxOverloadedRatio":0.2,"actualRatio":null}}}
INFO: BasicCrawler: All the requests from request list and/or request queue have been processed, the crawler will shut down.
Crawler finished.```

Any ideas on how to debug here?

```const Apify = require('apify');

Apify.main(async () => {
    // Create and initialize an instance of the RequestList class that contains the start URL.
    const requestList = new Apify.RequestList({
        sources: [
            { url: 'https://www.poetryfoundation.org/poets/browse#sort_by=recently_added&poet-birthdate=1951-present&preview=1&page=1' },
    await requestList.initialize();

    // Apify.openRequestQueue() is a factory to get a preconfigured RequestQueue instance.
    const requestQueue = await Apify.openRequestQueue();

    // Create an instance of the PuppeteerCrawler class - a crawler
    // that automatically loads the URLs in headless Chrome / Puppeteer.
    const crawler = new Apify.PuppeteerCrawler({
        // The crawler will first fetch start URLs from the RequestList
        // and then the newly discovered URLs from the RequestQueue

        // Here you can set options that are passed to the Apify.launchPuppeteer() function.
        // For example, you can set "slowMo" to slow down Puppeteer operations to simplify debugging
        launchPuppeteerOptions: { slowMo: 500 },

        // Stop crawling after several pages
        maxRequestsPerCrawl: 10,

        // This function will be called for each URL to crawl.
        // Here you can write the Puppeteer scripts you are familiar with,
        // with the exception that browsers and pages are automatically managed by the Apify SDK.
        // The function accepts a single parameter, which is an object with the following fields:
        // - request: an instance of the Request class with information such as URL and HTTP method
        // - page: Puppeteer's Page object (see https://pptr.dev/#show=api-class-page)
        handlePageFunction: async ({ request, page }) => {
            console.log(`Processing ${request.url}...`);

            // A function to be evaluated by Puppeteer within the browser context.
            const pageFunction = ($posts) => {
                const data = [];

                // We're getting the title, rank and URL of each post on Hacker News.
                $posts.forEach(($post) => {
                        name: $post.querySelector('.c-feature-hd a span').innerText,
                        href: $post.querySelector('.c-feature-hd a').href

                return data;

            const data = await page.$$eval('ol.c-vList_bordered_thorough li', pageFunction);

            console.log('data', data)
            // // Find a link to the next page and enqueue it if it exists.
            // const infos = await Apify.utils.puppeteer.enqueueLinks({
            //     page,
            //     requestQueue,
            //     selector: '.morelink',
            // });
            // if (infos.length === 0) console.log(`${request.url} is the last page!`);

        // This function is called if the page processing failed more than maxRequestRetries+1 times.
        handleFailedRequestFunction: async ({ request }) => {
            console.log(`Request ${request.url} failed too many times`);

    // Run the crawler and wait for it to finish.
    await crawler.run();

    console.log('Crawler finished.');
nemo commented 5 years ago

Ah – it's because I wasn't using the storage system Apify has.