apify / crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://crawlee.dev
Apache License 2.0
13.39k stars 573 forks source link

Crawler not finished / not resolving #505

Closed andyfo closed 4 years ago

andyfo commented 4 years ago

I've been testing Apify extensively recently and I've noticed a strange behavior - crawler sometimes doesn't stop/end properly when maxRequestsPerCrawl is reached and there are still requests being processed. It just never resolves.

I put a logger into the _maybeFinish method and it goes like this:

INFO: BasicCrawler: Crawler reached the maxRequestsPerCrawl limit of 200 requests and will shut down soon. Requests that are in progress will be allowed to finish.
_maybeFinish: _currentConcurrency = 2
_maybeFinish: _currentConcurrency = 1

And then silence. After the maxRequestsPerCrawl INFO the _maybeFinish polling stops at 1 and then nothing. It doesn't get past the await crawler.run();.

Any idea what I could be doing wrong? 🤔

Thanks!

mtrunkat commented 4 years ago

Do you use Basic, Cheerio or Puppeteer crawler?

andyfo commented 4 years ago

Puppeteer

petrpatek commented 4 years ago

Could you please share your configuration? I am suspecting that there is a big timeout, and the request gets stuck; therefore, it does not timeout in a reasonable amount of time. Then the question would be why it gets stuck.

tiany4 commented 4 years ago

Unlike OP, I didn't set maxRequestsPerCrawl, but I am experiencing the same issue with only certain sites, when the requestQueue is at least over 1000. It is hard to reproduce the issue because it only happens occasionally. The crawler will process all the requests, but will not move past await crawler.run(). There are no logs at all.

My initial guess was that this is related to some race condition or time out issue, but couldn't get further than that.

Here's my crawler definition:

const crawler = new Apify.PuppeteerCrawler({
    requestQueue: this.eventQueue,
    handlePageFunction: eventExtract,
    puppeteerPoolOptions: {
        maxOpenPagesPerInstance: 5
    },
    handleFailedRequestFunction: async ({ request }) => {
        console.log(`Request ${request.url} failed 4 times`);
    },
    gotoFunction: async ({ page, request }) => {
        await page.setRequestInterception(true);
        page.on('request', async interceptedRequest => {
            const type = interceptedRequest.resourceType();
            if (type === 'image' || type === 'stylesheet' || type === 'font') interceptedRequest.abort();
            else interceptedRequest.continue();
        });
        return page.goto(request.url, {
            timeout: 60000
        });
    },
});
mnmkng commented 4 years ago

@tiany4 How long does it get stuck for? There are some really long timeouts (~10 mins) that should tear down everything in case something goes wrong. Did it ever get past that amount of time?

tiany4 commented 4 years ago

@mnmkng It gets stuck indefinitely. I had left my computer on overnight and it was stuck for 12+ hours.

mnmkng commented 4 years ago

Thanks, that will help pinpoint the issue.

tiany4 commented 4 years ago

No problem. Please let me know if you need more info regarding my code.

mnmkng commented 4 years ago

We made quite a lot of unrelated changes lately. Could you try with apify@0.16.2-beta.8 ? Also, any chance you'd have a log or full code example?

tiany4 commented 4 years ago

I can try that yes. However, like I mentioned, this is a rare occurrence. I am scraping Bands in Town. I divided the site in 100 parts/runs, and this issue occurs roughly once in every 10-20 runs or so.

Attaching the source code below, with some mongodb info removed.

if (process.env.local) require("dotenv").config();
const MongoClient = require("mongodb").MongoClient;
const requestPromise = require("request-promise");
const moment = require("moment-timezone");
const Promise = require("bluebird");
global.Promise = Promise;
const Apify = require("apify");
const Logger = require("./logger");
const { log } = Apify.utils;
log.setLevel(log.LEVELS.ERROR);
const METADATA = JSON.parse(process.env.METADATA);
let region = METADATA.market;
const waitFor = ms => new Promise(r => setTimeout(r, ms));
const now = moment().format();
const DEBUG = process.env.DEBUG === "true" ? true : false;

let errorCount = 0, eventCount = 0, totalEvents = 0, mongoUrl, logger, imageErrorCount = 0, mongoConnection, DAYS_OUT;

Apify.main(async () => {

    // Create new Logger instance
    logger = new Logger({
        instance_id: process.env.INSTANCE_ID || 'localhost',
        name: process.env.EXTRACT || 'test-run',
        environment: process.env.NODE_ENV,
        debug: DEBUG,
        days_out: process.env.DAYS_OUT ? DAYS_OUT = process.env.DAYS_OUT : DAYS_OUT = 90,
        run_id: process.env.RUN_ID,
        source_token: process.env.SOURCE_TOKEN,
        root_url: process.env.ROOT_URL
    });

    // Call Logger Init functions to add logger transport to loggly and establish mongoDB connection to runs collection
    await logger.init();
    process.env.local ? (logger.log = console.log) : (console.log = function () { });

    logger.log(`Beginning load in region ${region}`);

    // Establish events and venues collection connections
    mongoConnection = await MongoClient.connect(mongoUrl, { useUnifiedTopology: true });
    const db = mongoConnection.db('bandsInTownDB');
    const eventsCollection = db.collection('events');

    // Create event queue and crawler logic
    this.eventQueue = await Apify.openRequestQueue('bandsInTownEvents');
    const eventCrawler = new Apify.PuppeteerCrawler({
        requestQueue: this.eventQueue,
        handlePageFunction: eventExtract,
        puppeteerPoolOptions: {
            maxOpenPagesPerInstance: 5
        },
        handleFailedRequestFunction: async ({ request }) => {
            if (DEBUG) logger.log(`Request ${request.url} failed 4 times`);
        },
        gotoFunction: async ({ page, request }) => {
            await page.setRequestInterception(true);
            page.on('request', async interceptedRequest => {
                const type = interceptedRequest.resourceType();
                if (type === 'image' || type === 'stylesheet' || type === 'font') interceptedRequest.abort();
                else interceptedRequest.continue();
            });
            return page.goto(request.url, {
                timeout: 60000
            });
        },
    });

    // query mongoDB for all events to transform
    const query = [
        {
            '$match': {
                '$and': [
                    { 'event.dates.startDate': { '$gt': now } },
                    { 'run_id': logger.run_id }
                ]
            }
        }
    ];

    let results = await eventsCollection.aggregate(query, { allowDiskUse: true }).toArray();
    await waitFor(10000);
    totalEvents = results.length;
    logger.log(`${totalEvents} events pulled.`);
    await mongoConnection.close();
    logger.log('Mongo connection closed.');

    // Add event URLs to eventCrawler, attaching event data as userData
    await Promise.mapSeries(results, async event => {
        await this.eventQueue.addRequest({ url: event.url, userData: event });
    });
    logger.log(`Event queue finished processing. Running crawler.`);
    await eventCrawler.run();
    logger.log(`Finishing run...`);
    await waitFor(60000);
    logger.log(`Transform finished with ${errorCount} errors and ${eventCount} out of ${totalEvents} expected events. There are ${imageErrorCount} events without images.`);

    // Update the runObject with successfulLoads and totalLoads
    await logger.updateRun({ load_count: eventCount });
    await logger.updateRun({ total_load_count: totalEvents });  

    process.exit();
});

const eventExtract = async ({ page, request }) => {

    let cleanEvent;
    let scrapedJSONs, parsedJSON;

    // scrape JSONs and find the correct one to parse
    try {
        scrapedJSONs = await page.$$eval('script[type="application/ld+json"]', a => a.map(el => el.innerText));
    } catch(e) {if(DEBUG) logger.log(`JSON scraping error: ${e}`);}

    scrapedJSONs.forEach(json => {
        json = JSON.parse(json);
        if (json['@type'] === 'MusicEvent') parsedJSON = json;
    });

    // exit if no JSON found
    if (!parsedJSON) return;

    // location info; coords are not included because they're inaccurate
    cleanEvent.place.location = {
        'street': parsedJSON.location.address.streetAddress,
        'city': parsedJSON.location.address.addressLocality,
        'region': parsedJSON.location.address.addressRegion,
        'postcode': parsedJSON.location.address.postalCode ? parsedJSON.location.address.postalCode : null,
        'country': parsedJSON.location.address.addressCountry
    };

    // post to db
    const options = {
        headers: {
            Authorization: `Bearer ${logger.source_token}`,
        },
        method: 'POST',
        uri: `${logger.root_url}/event`,
        body: cleanEvent,
        json: true // Automatically stringifies the body to JSON
    };

    let result = await requestPromise(options)
        .then(parsedBody => {
            eventCount++;
            if (DEBUG) logger.log(`${cleanEvent.event.name} posted! ${eventCount} out of ${totalEvents} have been loaded.`);
        })
        .catch(err => {
            errorCount++;
            if (DEBUG) logger.log(err.message, 'warn', cleanEvent);
        });
    return result;
};
tiany4 commented 4 years ago

0.16.2-beta.8 did not resolve the issue, just had it happen on a recent run with about 670 total requests in requestQueue. One thing I should mention though: most of the times, these runs are on on a AWS EC2 instance. However, I have had the same problem locally, so I don't believe that to be the cause.

mnmkng commented 4 years ago

Thanks for the further info @tiany4. Sadly, with the complexity of your code and random appearance of the issue, I don't think we'll be able to reproduce and fix it anytime soon, unless we happen on it by chance. There are tens of thousands of runs on Apify platform every day and we do not have any other reports of similar behavior. We'll keep an eye out.

mnmkng commented 4 years ago

MIght be related to this though: https://github.com/GoogleChrome/puppeteer/issues/5031

tiany4 commented 4 years ago

Understood. That other issue does not look relevant unfortunately. I'll keep an eye on this issue and post if I have any new info.

cspeer commented 4 years ago

Hi,

I have the same problem using CheerioCrawler and with minimal setup. I tested it over the weekend and it got stuck for 48+ hours. Here is some simple code to reproduce:

const Apify = require('apify');

Apify.main(async () => {
  const requestQueue = await Apify.openRequestQueue();
  await requestQueue.addRequest({ url: 'https://www.4komma5.de/' });
  const crawler = new Apify.CheerioCrawler({
    requestQueue,
    async handlePageFunction ({ request, $ }) {
      console.log(`Processing ${request.url}...`);

      await Apify.utils.enqueueLinks({
        $,
        selector: 'a[href]:not([rel="nofollow"])',
        requestQueue,
        pseudoUrls: [new Apify.PseudoUrl('https://www.4komma5.de/[.*]')],
        baseUrl: 'https://www.4komma5.de/'
      });
    },
    minConcurrency: 3,
    maxConcurrency: 10,
    maxRequestRetries: 0,
    maxRequestsPerCrawl: 10000,
    handleFailedRequestFunction ({ request }) {
      Apify.utils.log.error('CheerioCrawler: Request failed and reached maximum retries', request);
    }
  });

  await crawler.run();

  console.log('Crawler finished.');
});

This should look at about 8600 URLs, so I set maxRequestsPerCrawl = 10000 just in case. You also might have to run it about 2/3 times since it might not occur on first run. This routine takes about 40/45 min on our servers to complete.

Hope this helps.

// edit: Symptoms are:

tiany4 commented 4 years ago

@cspeer Hey, thanks for the input! I'm hoping that these are indeed related issues since cheerio works differently than puppeteer, but have you found any workaround for the time being?

Edit: I wanted to add that I have exactly the same symptoms too.

cspeer commented 4 years ago

Unfortunately not. I originally thought it might be related to retries, so disabling retries (maxRequestRetries = 0) seemed to be working at first, but then i turned out that it's not.

tiany4 commented 4 years ago

Ahh that's too bad. I also tried maxOpenPagesPerInstance: 5 in puppeteerPoolOptions, which solved another issue, but unfortunately didn't help this one.

tiany4 commented 4 years ago

@mnmkng I got some additional info.

Finally had a local headful run get stuck at the end. Here is what happened:

1) There was one Chromium browser open with one blank tab. Closing the browser manually gave ERROR: PuppeteerPool: Puppeteer sent "disconnect" event. Maybe it crashed??? {"id":2}

2) Nothing happens after that point so I forced an exit. Upon deleting apify_storage and rerunning the script, I got a bunch of ENOENT errors like this:

ERROR: AutoscaledPool: runTaskFunction failed.
  Error: ENOENT: no such file or directory, scandir 'C:\Users\tpstc\Documents\apify-pointslocal\apify_storage\request_queues\pointslocalEvents\pending'
ERROR: AutoscaledPool: runTaskFunction failed.
  Error: ENOENT: no such file or directory, scandir 'C:\Users\tpstc\Documents\apify-pointslocal\apify_storage\request_queues\pointslocalEvents\pending'
ERROR: AutoscaledPool: runTaskFunction failed.
  Error: ENOENT: no such file or directory, scandir 'C:\Users\tpstc\Documents\apify-pointslocal\apify_storage\request_queues\pointslocalEvents\pending'
......
ERROR: The function passed to Apify.main() threw an exception:
  Error: ENOENT: no such file or directory, scandir 'C:\Users\tpstc\Documents\apify-pointslocal\apify_storage\request_queues\pointslocalEvents\pending'
Run: npm.cmd start

> apify-pointslocal@1.0.0 start C:\Users\tpstc\Documents\apify-pointslocal
> node main.js

This made me think that the issue might be related to the crawler thinking there are still pending requests while there is none left.

cspeer commented 4 years ago

While I don't see what @tiany4 is reporting... shouldn't there be any of those errors since you deleted apify_storage and this is where the state is saved to? That's really weird.

tiany4 commented 4 years ago

It is a fresh run so apify_storage being nonexistent shouldn't be an issue. But this is the only time I've seen this error, so that might not be related to the issue at all. The empty browser instance(s), on the other hand, are most likely related.

tiany4 commented 4 years ago

Something else I noticed: while it's stuck at the end, AutoscaledPool is constantly scaling up.

DEBUG: AutoscaledPool: scaling up {"oldConcurrency":56,"newConcurrency":59,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0.071},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
DEBUG: PuppeteerPool: Retired browsers count {"count":0}
DEBUG: AutoscaledPool: scaling up {"oldConcurrency":59,"newConcurrency":62,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0.069},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
INFO: AutoscaledPool state {"currentConcurrency":58,"desiredConcurrency":62,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0.069},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
mnmkng commented 4 years ago

@tiany4 @cspeer Are you both guys using Windows?

The scaling up is weird. That would mean there are tasks to be run. currentConcurrency reflects the actual amount of tasks in progress so if this is going up, then there must be something to process.

cspeer commented 4 years ago

No, it's all Unix/Linux based on our machines.

tiany4 commented 4 years ago

I'm using git bash on windows. However, the AWS instances we run these scripts on are ubuntu. Same issues happen on both systems.

Edit: and yeah, I didn't leave the script running for too long, but by the time I terminated it, it scaled currentConcurrency up to around 500 and was going up.

tiany4 commented 4 years ago

@mnmkng Is there a way to console log some debug info regarding tasks?

mnmkng commented 4 years ago

Depends on what info you're interested in. There's no special logging in AutoscaledPool, but you can always wrap the existing functions after the crawler starts, because autoscaledPool is available in the handlePageFunction.

let isWrapped = false;

async function handlePageFunction({ ..., autoscaledPool }) {
// ...

  if (!isWrapped) {
     const originalRunTaskFunction = autoscaledPool.runTaskFunction.bind(autoscaledPool);
     autoscaledPool.runTaskFunction = async () => {
       console.log('Before');
       await originalRunTaskFunction();
       console.log('After');
     }
     isWrapped = true;
  }

}

It's not a best practice, but it gets the job done. We want to rework most of our logging to make it more customizable and granular, but since it's closely tied with the Apify Platform, it'll take some time before we get there.

surfshore commented 4 years ago

Hi, I have the same problem. There's an 50%+ chance that the in my case scenario will happen. I tried debugging a little.

I think the cause is in RequestQueueLocal class, pendingCount. The finish condition is that "this.pendingCount === this.inProgressCount" at requestQueue.isEmpty(), If it did not finish, pendingCount was the value of minus.

If try to crawl 6000, it will look like this: start / pendingCount:6000 inProgressCount:0 (maybe)finish / pendingCount:-246 inProgressCount:0 _handledCount:6246 progressQueueCount:0

The code that increases pendingCount is only in addRequest(), I think the problem is not to reach that code. "requestCopy.id" does not include "this.requestIdToQueueOrderNo[]" ?

I hope I can be of any help to you.

In my case.

mnmkng commented 4 years ago

@surfshore Thanks for the additional info. I know that this issue's been hanging here for a while, but I think we'll be able to start working on it in the next week or so.

cspeer commented 4 years ago

What @surfshore is saying makes a lot of sense, in my opinion, since the behaviour we're seeing fits this description to a t!

tiany4 commented 4 years ago

I'm guessing that the issue is caused by a race condition somewhere, since I have been having some success by greatly limiting the crawling speed on problematic sites. I did also notice that the total handled count is sometimes higher than the original # of requests, in which case switching to SQLite could very possibly be the solution.

mnmkng commented 4 years ago

Hey guys, it's been some time but we finally have a completely new RequestQueue ready for testing. It is based on the SQLite database and we would like to invite you to test it and see if it solves your problem. To get it, simply use

"apify": "dev" or "apify": "0.20.3-dev.0" in your package.json dependencies.

Thanks.

cspeer commented 4 years ago

Alright, we have it up and running on our servers. Let's give it a few hours and I can report back. First impression is good though!

surfshore commented 4 years ago

Thank you for making the revisions. We are currently testing. Everything goes well so far. Before starting the test, we got an error. If there is no data in the requestQueue, errors may occur.

case1. When adding from CSV to requestList and starting with requestQueue empty.

const requestList = new Apify.RequestList({
    sources: sources, // csv data
    persistStateKey: `${listId}-state`,
    persistRequestsKey: `${listId}-sources`,
});
await requestList.initialize();
const requestQueue = await Apify.openRequestQueue(queueId);
await requestQueue.drop();
ERROR The function passed to Apify.main() threw an exception:
  TypeError: Cannot destructure property `handledRequestCount` of 'undefined' or 'null'.
      at RequestQueue.handledCount (/app/node_modules/apify/build/request_queue.js:592:15)
      at async BasicCrawler._loadHandledRequestCount (/app/node_modules/apify/build/crawlers/basic_crawler.js:445:41)
      at async BasicCrawler.run (/app/node_modules/apify/build/crawlers/basic_crawler.js:263:9)
      at async CheerioCrawler.run (/app/node_modules/apify/build/crawlers/cheerio_crawler.js:384:9)
      at async /app/spider/spider1.js:79:5
error Command failed with exit code 91.

case2. When requestQueue.drop(), next adding from CSV to requestQueue. if requestQueue.drop() is comment out, it's fine.

const requestQueue = await Apify.openRequestQueue(queueId);
await requestQueue.drop();
for await (const request of this.requestGenerator(data)) {
    await requestQueue.addRequest(request);
});
ERROR The function passed to Apify.main() threw an exception:
  Error: Request queue with id: 1 does not exist.
      at RequestQueues.<anonymous> (/app/node_modules/@apify/storage-local/src/request_queues.js:370:27)
      at RequestQueues.sqliteTransaction [as addRequestTransaction] (/app/node_modules/better-sqlite3/lib/transaction.js:58:24)
      at RequestQueues.addRequest (/app/node_modules/@apify/storage-local/src/request_queues.js:126:21)
      at RequestQueue.addRequest (/app/node_modules/apify/build/request_queue.js:249:54)
      at SpiderRequest.makeRequests (/app/lib/spiderRequest.js:173:38)
mnmkng commented 4 years ago

Thanks a lot for the error reports @surfshore

I’ll look into the first one. The second one seems correct, no? If you first delete the queue then you obviously can’t add more requests to it. I’ll check how it behaved originally in this case. Perhaps a better error message would suffice?

Thinking about it. You’re deleting the queue in the first one too, so no wonder it can’t add more requests. Just the error is bad. Will fix that.

cspeer commented 4 years ago

So, first resume after running about 8 requestQueues that we either add URLs to manually or by discovering them from the HTML: No errors, no warnings, smooth sailing from start to finish 🎉 Thank you! Great job! I would give it a couple more days, because sometimes it would take a few runs to show its true colors so to speak. I'll keep you posted!

mnmkng commented 4 years ago

Thank you for the kind words @cspeer and surely let us know if anything shows up on the radar.

surfshore commented 4 years ago

The test has been successfully completed. Thank you so much!

In the test, we first added 48000 to the queue (requestQueue.addRequest()), and the request was back to the queue 3000 times due to a failed request, and 46000 new requests were added on the way (requestQueue.addRequest()), and final processing a total of 97000 requests.

@mnmkng , about the requestQueue.drop() above. In test case, i didn't need the old requests in the requestQueue and wanted to use the requestQueue.drop() to empty the requestQueue. I misunderstood the requestQueue.drop(). It looks like it worked well before.

surfshore commented 4 years ago

The test for cheerioCrawler finished without any problem, but puppeteerCrawler did not end. requestQueue internal variables are maybe fine, so it's probably another matter. If a loop occurs, the request times out every time. (Timeout gradually increases.) If we execute the process at the same timing in another process, we can get response without problems. So, i don't think it's a proxy or web server problem. (bot detection etc.) Please tell me if there is any information about that.

INFO  Crawler request statistics: {"avgDurationMillis":24422,"perMinute":26,"finished":1486,"failed":0,"retryHistogram":[1486]}
INFO  AutoscaledPool state {"currentConcurrency":20,"desiredConcurrency":20,"systemStatus":{"isSystemIdle":false,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"cpuInfo":{"isOverloaded":true,"limitRatio":0.4,"actualRatio":0.928},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
INFO  Crawler request statistics: {"avgDurationMillis":24463,"perMinute":26,"finished":1492,"failed":0,"retryHistogram":[1492]}
INFO  AutoscaledPool state {"currentConcurrency":20,"desiredConcurrency":20,"systemStatus":{"isSystemIdle":false,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"cpuInfo":{"isOverloaded":true,"limitRatio":0.4,"actualRatio":0.93},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
INFO  Crawler request statistics: {"avgDurationMillis":24495,"perMinute":25,"finished":1493,"failed":0,"retryHistogram":[1493]}
INFO  AutoscaledPool state {"currentConcurrency":20,"desiredConcurrency":20,"systemStatus":{"isSystemIdle":false,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0.019},"cpuInfo":{"isOverloaded":true,"limitRatio":0.4,"actualRatio":0.931},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
(loop ...)
INFO  Crawler request statistics: {"avgDurationMillis":24737,"perMinute":25,"finished":1501,"failed":0,"retryHistogram":[1501]}
INFO  AutoscaledPool state {"currentConcurrency":20,"desiredConcurrency":20,"systemStatus":{"isSystemIdle":false,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0.039},"cpuInfo":{"isOverloaded":true,"limitRatio":0.4,"actualRatio":0.932},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
ERROR BasicCrawler: handleRequestFunction failed, reclaiming failed request back to the list or queue {"url":"https://xxx/xxx1.html","retryCount":1,"id":"zt55hpRZejBRY9O"}
  TimeoutError: Navigation timeout of 30000 ms exceeded
(...)
INFO  Crawler request statistics: {"avgDurationMillis":24737,"perMinute":25,"finished":1501,"failed":0,"retryHistogram":[1501]}
INFO  AutoscaledPool state {"currentConcurrency":20,"desiredConcurrency":20,"systemStatus":{"isSystemIdle":false,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"cpuInfo":{"isOverloaded":true,"limitRatio":0.4,"actualRatio":0.933},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
ERROR BasicCrawler: handleRequestFunction failed, reclaiming failed request back to the list or queue {"url":"https://xxx/xxx2.html","retryCount":1,"id":"sHfGkFAzcSqoW1f"}
  TimeoutError: Navigation timeout of 30000 ms exceeded
ERROR BasicCrawler: handleRequestFunction failed, reclaiming failed request back to the list or queue {"url":"https://xxx/xxx3.html","retryCount":1,"id":"LiMwxpppjy5YEoN"}
  TimeoutError: Navigation timeout of 30000 ms exceeded
(...)
INFO  Crawler request statistics: {"avgDurationMillis":24737,"perMinute":24,"finished":1501,"failed":0,"retryHistogram":[1501]}
INFO  AutoscaledPool state {"currentConcurrency":20,"desiredConcurrency":20,"systemStatus":{"isSystemIdle":false,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0.019},"cpuInfo":{"isOverloaded":true,"limitRatio":0.4,"actualRatio":0.934},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
ERROR BasicCrawler: handleRequestFunction failed, reclaiming failed request back to the list or queue {"url":"https://xxx/xxx4.html","retryCount":1,"id":"Cr6l5eYWw0EPRee"}
  TimeoutError: Navigation timeout of 30000 ms exceeded
ERROR BasicCrawler: handleRequestFunction failed, reclaiming failed request back to the list or queue {"url":"https://xxx/xxx5.html","retryCount":1,"id":"izXNxZoQSOyaeDA"}
  TimeoutError: Navigation timeout of 30000 ms exceeded
(...)

Postscript

Changing the number of browser restarts reduced the number of records that could be retrieved. retireInstanceAfterRequestCount 100(default) => "finished":1501 retireInstanceAfterRequestCount 20 => "finished":339

cspeer commented 4 years ago

So, here's a head scratcher:

Occasionally now I get

Request queue with id: 3 does not exist.

What I do is, open a new request queue, see if it's empty and if it is add URLs in a promise array:

  const requestQueue = await Apify.openRequestQueue(dealer.handle + '-refresh');

  if (await requestQueue.isEmpty()) {
    Helpers.log('Loading URLs');
    const items = await Models.Item
      .query()
      .select('url')
      .whereNotNull('url')
      .where('source', dealer.handle)
    ;

    await Promise.all(_.map(items, async (item) => {
      return requestQueue.addRequest({ url: item.url });
    }));
  } else {
    Helpers.log('Resuming');
  }

So the queue exists and is accessiable at the if-statement, but then upon adding the URLs it throws Request queue with id: 3 does not exist.. But like I said, only sometimes. Maybe it is a timing problem.

surfshore commented 4 years ago

The following error occurred during today's test run. (cheerioCrawler)

ERROR BasicCrawler: runTaskFunction error handler threw an exception. This places the crawler and its underlying storages into an unknown state and crawling will be terminated. This may have happened due to an internal error of Apify's API or due to a misconfigured crawler. If you are sure that there is no error in your code, selecting "Restart on error" in the actor's settingswill make sure that the run continues where it left off, if programmed to handle restarts correctly.
  SqliteError: database is locked
      at RequestQueues.<anonymous> (/app/node_modules/@apify/storage-local/src/request_queues.js:388:39)
      at RequestQueues.sqliteTransaction [as updateRequestTransaction] (/app/node_modules/better-sqlite3/lib/transaction.js:58:24)
      at RequestQueues.updateRequest (/app/node_modules/@apify/storage-local/src/request_queues.js:167:21)
      at RequestQueue.reclaimRequest (/app/node_modules/apify/build/request_queue.js:401:54)
      at BasicCrawler._requestFunctionErrorHandler (/app/node_modules/apify/build/crawlers/basic_crawler.js:422:27)
      at BasicCrawler._runTaskFunction (/app/node_modules/apify/build/crawlers/basic_crawler.js:369:28)
      at runMicrotasks (<anonymous>)
      at processTicksAndRejections (internal/process/task_queues.js:94:5)
ERROR AutoscaledPool: runTaskFunction failed.
  SqliteError: database is locked
mnmkng commented 4 years ago

@surfshore

The test for cheerioCrawler finished without any problem, but puppeteerCrawler did not end.

It seems that your machine is CPU overloaded at that point. See the AutoscaledPool logs, it says that your CPU is overloaded almost 100% (0.9xx and growing) of the time. This prevents the crawler from functioning properly. Could you please look into what might be causing this overloading? Are you forcing high concurrency or are you letting the crawler scale by itself?

The following error occurred during today's test run. (cheerioCrawler)

I've never seen that one. Will take a look what could be causing it. Are you possibly running multiple crawlers / processes that are using the same file system (i.e. database file)?

@cspeer

So, here's a head scratcher:

Does it fail at the very first request that you're adding or at some point in the middle of adding the requests?

cspeer commented 4 years ago

@surfshore Regarding the "database is locked" problem: It seems that this is caused by multiple running instances of the crawler accessing the same sqlite database file. I successfully worked around that by having my code create separate directories for each instance of the crawler and then setting process.env.APIFY_LOCAL_STORAGE_DIR so that every crawler has its own database.

cspeer commented 4 years ago

@mnmkng Sorry, it took some time for the error to reappear. So the answer to your question is: It happens in the middle of adding the requests.

mnmkng commented 4 years ago

@cspeer The only thing that comes to mind is that you're using the same DB from multiple processes and that the queue gets deleted from the DB by some other process. The ID 3 is also weird, because it implies there are two other queues in the DB which were created before this one.

cspeer commented 4 years ago

Yes, you might be right. The only crawler this happens with is one that is special in the sense that it crawls a certain website over and over again, after finishing it drops the queue and creates a new one with the same name as before. Every other crawler in our system uses the same code except that after finishing it drops the queue and creates a new queue with a different name. So, you think maybe the queue-name is the problem here?

mnmkng commented 4 years ago

Ok, got it now. Thanks. It's most likely a bug in caching of request queue instances. Will fix today and release a new version.

mnmkng commented 4 years ago

@cspeer 0.20.3-dev.1 is out. Please let me know if it fixes your problem.

surfshore commented 4 years ago

@mnmkng , @cspeer Thank you for giving me good advice.

about "puppeteerCrawler did not end". The parse of the page may be heavy. I will try to check up on that.

The settings are as follows.
minConcurrency: 10,
maxConcurrency: 30,
autoscaledPoolOptions: {
    desiredConcurrency: 20,
    maybeRunIntervalSecs: 0.01, // Is this bad?
},

"database is locked" problem. Due to the need for high concurrency, we split requests and run multiple crawlers at the same time. (cpu thread split) The QUEUE_ID of the requestQueue differs for each crawler. Do they conflict? If the DATASET_ID of the datasets are different, the directories are different and i thought they were the same. Is it better to split APIFY_LOCAL_STORAGE_DIR?

mnmkng commented 4 years ago

@surfshore

Due to the need for high concurrency, we split requests and run multiple crawlers at the same time. (cpu thread split)

The underlying database is not very good at high concurrency writes. So if you're adding requests into the queue from multiple processes, you might run into concurrency problems. Also, this is a first version, so it's not very well optimized for such multi-process scenarios.

Let me try a quick fix though and let's see if it'll help. If not, using a separate APIFY_LOCAL_STORAGE_DIR for each process might be the best solution.

The settings are as follows.

Unless you're running on a really powerful machine, 20 concurrency is not something that is easily achievable with Puppeteer. Also, from our experience, increasing concurrency beyond a certain point does not speed up the crawls, because it only prolongs the time individual requests take to process.

I suggest simply starting without any extra config and watching how the crawler scales automatically from the AutoscaledPool logs.

You can add the APIFY_LOG_LEVEL=DEBUG env var or call Apify.utils.log.setLevel(Apify.utils.log.LEVELS.DEBUG) before start of the crawler, to get additional logging.

I would almost never configure maybeRunIntervalSecs. It only controls how often the pool will ping for new requests when it has nothing to do so lowering it probably has no positive impact on performance, but setting it too low could have a negative impact.