apify / crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://crawlee.dev
Apache License 2.0
15.39k stars 662 forks source link

handlePageFunction only worked the first time I ran the script #333

Closed nemo closed 5 years ago

nemo commented 5 years ago

Hey folks,

Excited to properly give Apify a try. It worked literally once for me (and it worked well!), but every subsequent run I've done on the script below just prints the following (which means it never calls the handlePageFunction):

WARNING: Neither APIFY_LOCAL_STORAGE_DIR nor APIFY_TOKEN environment variable is set, defaulting to APIFY_LOCAL_STORAGE_DIR="/Users/nimagardideh/Documents/workspace/poetry/backend/functions/apify_storage"
INFO: AutoscaledPool: Setting max memory of this run to 4096 MB. Use the APIFY_MEMORY_MBYTES environment variable to override it.
INFO: AutoscaledPool state {"currentConcurrency":0,"desiredConcurrency":2,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"maxOverloadedRatio":0.2,"actualRatio":null},"eventLoopInfo":{"isOverloaded":false,"maxOverloadedRatio":0.4,"actualRatio":null},"cpuInfo":{"isOverloaded":false,"maxOverloadedRatio":0.4,"actualRatio":null},"clientInfo":{"isOverloaded":false,"maxOverloadedRatio":0.2,"actualRatio":null}}}
INFO: BasicCrawler: All the requests from request list and/or request queue have been processed, the crawler will shut down.
Crawler finished.```

Any ideas on how to debug here?

```const Apify = require('apify');

Apify.main(async () => {
    // Create and initialize an instance of the RequestList class that contains the start URL.
    const requestList = new Apify.RequestList({
        sources: [
            { url: 'https://www.poetryfoundation.org/poets/browse#sort_by=recently_added&poet-birthdate=1951-present&preview=1&page=1' },
        ],
    });
    await requestList.initialize();

    // Apify.openRequestQueue() is a factory to get a preconfigured RequestQueue instance.
    const requestQueue = await Apify.openRequestQueue();

    // Create an instance of the PuppeteerCrawler class - a crawler
    // that automatically loads the URLs in headless Chrome / Puppeteer.
    const crawler = new Apify.PuppeteerCrawler({
        // The crawler will first fetch start URLs from the RequestList
        // and then the newly discovered URLs from the RequestQueue
        requestList,
        requestQueue,

        // Here you can set options that are passed to the Apify.launchPuppeteer() function.
        // For example, you can set "slowMo" to slow down Puppeteer operations to simplify debugging
        launchPuppeteerOptions: { slowMo: 500 },

        // Stop crawling after several pages
        maxRequestsPerCrawl: 10,

        // This function will be called for each URL to crawl.
        // Here you can write the Puppeteer scripts you are familiar with,
        // with the exception that browsers and pages are automatically managed by the Apify SDK.
        // The function accepts a single parameter, which is an object with the following fields:
        // - request: an instance of the Request class with information such as URL and HTTP method
        // - page: Puppeteer's Page object (see https://pptr.dev/#show=api-class-page)
        handlePageFunction: async ({ request, page }) => {
            console.log(`Processing ${request.url}...`);

            // A function to be evaluated by Puppeteer within the browser context.
            const pageFunction = ($posts) => {
                const data = [];

                // We're getting the title, rank and URL of each post on Hacker News.
                $posts.forEach(($post) => {
                    data.push({
                        name: $post.querySelector('.c-feature-hd a span').innerText,
                        href: $post.querySelector('.c-feature-hd a').href
                    });
                });

                return data;
            };

            const data = await page.$$eval('ol.c-vList_bordered_thorough li', pageFunction);

            console.log('data', data)
            // // Find a link to the next page and enqueue it if it exists.
            // const infos = await Apify.utils.puppeteer.enqueueLinks({
            //     page,
            //     requestQueue,
            //     selector: '.morelink',
            // });
            //
            // if (infos.length === 0) console.log(`${request.url} is the last page!`);
        },

        // This function is called if the page processing failed more than maxRequestRetries+1 times.
        handleFailedRequestFunction: async ({ request }) => {
            console.log(`Request ${request.url} failed too many times`);
        },
    });

    // Run the crawler and wait for it to finish.
    await crawler.run();

    console.log('Crawler finished.');
});
nemo commented 5 years ago

Ah – it's because I wasn't using the storage system Apify has.