apify / crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://crawlee.dev
Apache License 2.0
15.77k stars 679 forks source link

The database connection is not open #904

Closed cspeer closed 3 years ago

cspeer commented 3 years ago

Describe the bug We just upgraded from 0.21.4 to 0.22.4. Now, we sometimes get a The database connection is not open when executing after the queue finishes:

const requestQueue = await Apify.openRequestQueue('some-queue-name');

Here is an example log output:

INFO  CheerioCrawler: All the requests from request list and/or request queue have been processed, the crawler will shut down.

INFO  CheerioCrawler: Final request statistics: {"requestsFinished":6464,"requestsFailed":0,"retryHistogram":[6464],"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":520,"requestsFinishedPerMinute":1035,"requestsFailedPerMinute":0,"requestTotalDurationMillis":3359734,"requestsTotal":6464,"crawlerRuntimeMillis":374896}

TypeError: The database connection is not open

Here's the full stack trace:

TypeError: The database connection is not open
  File "/home/****-crawler/releases/20210114184826/node_modules/@apify/storage-local/src/emulators/request_queue_emulator.js", line 373, col 17, in RequestQueueEmulator._createTables
    this.db.prepare(`
  File "/home/****-crawler/releases/20210114184826/node_modules/@apify/storage-local/src/emulators/request_queue_emulator.js", line 37, col 14, in new RequestQueueEmulator
    this._createTables();
  File "/home/****-crawler/releases/20210114184826/node_modules/@apify/storage-local/src/resource_clients/request_queue.js", line 74, col 29, in RequestQueueClient._getEmulator
    this.emulator = new RequestQueueEmulator({
  File "/home/****-crawler/releases/20210114184826/node_modules/@apify/storage-local/src/resource_clients/request_queue.js", line 84, col 18, in RequestQueueClient.get
    this._getEmulator().updateAccessedAtById(this.id);
  File "/home/****-crawler/releases/20210114184826/node_modules/apify/build/storages/storage_manager.js", line 97, col 53, in StorageManager._getOrCreateStorage
    const existingStorage = await storageClient.get();
  File "/home/****-crawler/releases/20210114184826/node_modules/apify/build/storages/storage_manager.js", line 52, col 46, in StorageManager.openStorage
    const storageObject = await this._getOrCreateStorage(idOrName, this.name, client);
  File "/home/****-crawler/releases/20210114184826/node_modules/apify/build/storages/request_queue.js", line 582, col 20, in Object.exports.openRequestQueue
    return manager.openStorage(queueIdOrName, options);
  File "/home/****-crawler/releases/20210114184826/spider/refresh.js", line 120, col 36, in crawl
    const requestQueue = await Apify.openRequestQueue(dealer.handle + '-refresh');
  File "/home/****-crawler/releases/20210114184826/spider/refresh.js", line 234, col 11, in null.<anonymous>
    await crawl();

To Reproduce It's kind of impossible to reproduce right now as we didn't change our code and the error only occurs occasionally and only after upgrading the SDK.

System information:

Thanks, Chris

mnmkng commented 3 years ago

Yeah, there's a completely new SQLite based Request Queue in 0.22. This looks like some issue with teardown of the database connection.

Could you confirm that I understand the process correctly?

  1. You have a crawler running and that crawler finishes.
  2. After the crawler finishes, you call openRequestQueue on the queue that just finished.
  3. Sometimes an error is thrown in that call.

Does everything else work correctly when the error is not thrown?

cspeer commented 3 years ago

Almost, except I don't trigger the call to openRequestQueue deliberately after the crawler finishes. Sorry for not providing sample code.

Here is a simplified version of what we use (I believe this is very similar to your example code from the docs):

const Apify = require('apify');
process.env.APIFY_LOCAL_STORAGE_DIR = 'spider/store';

Apify.main(async () => {
  const requestQueue = await Apify.openRequestQueue('apify-test');
  await requestQueue.addRequest({ url: 'https://www.versandhaus-schneider.de/' });
  const crawler = new Apify.CheerioCrawler({
    requestQueue,
    async handlePageFunction ({ request, $, page }) {
      console.log(`Processing ${request.url}`);

      await Apify.utils.enqueueLinks({
        page,
        $,
        selector: 'a[href]:not([rel="nofollow"])',
        requestQueue,
        pseudoUrls: [new Apify.PseudoUrl('https://www.versandhaus-schneider.de/[.*]')],
        baseUrl: 'https://www.versandhaus-schneider.de/',
        transformRequestFunction: (request) => {
          if (!/(\/cPath\/|product_info\.php)/.test(request.url)) {
            return null;
          }

          request.url = request.url.replace(/\?.+$/, '');

          return request;
        }
      });
    },
    minConcurrency: 3,
    maxConcurrency: 10,
    maxRequestRetries: 0,
    maxRequestsPerCrawl: 10000
  });

  await crawler.run();

  console.log('Crawler finished.');
});

So, here is how it goes:

  1. I let the crawler run with await crawler.run(); and (from what I can tell) it works as usual
  2. The crawler finishes (see the log output from OP)
  3. Now the error The database connection is not open is thrown from invoking const requestQueue = await Apify.openRequestQueue('apify-test');

So, maybe, because of concurrency, one of the last few calls actually closes the database connection before(!) the last call to the database is made. That would also explain why it only happens sometimes and not every time.

Does everything else work correctly when the error is not thrown?

Yes!

Thanks @mnmkng

cspeer commented 3 years ago

Just a small and polite bump @AndreyBykov any ideas so far?

AndreyBykov commented 3 years ago

Hey @cspeer, sorry, finishing with something else at the moment, will probably have an update tomorrow. Will update here ASAP.

mnmkng commented 3 years ago

@cspeer I'm sorry but I can't reproduce it with the example code. How often does it happen to you? There must be something specific to your setup. Any post processing? Parallelization? I noticed in the stack trace that it's on line 120 and the store is called *-refresh which tells me it's not as simple as the example you shared.

File "/home/gunfinder-crawler/releases/20210114184826/spider/refresh.js", line 120, col 36, in crawl
    const requestQueue = await Apify.openRequestQueue(dealer.handle + '-refresh');

I still don't understand how an error can be thrown from a function that must have finished minutes ago. Can you change the content of your node_modules in your build? Could you try adding this?

// utils.js - newStorageLocal()
process.on('exit', () => {
        // TODO this is not public API, need to update
        // storage local with some teardown
        try {
            storage.dbConnections.closeAllConnections();
        } catch (err) {
            console.error('Could not close database connections in the on-exit hook.');
            console.error(err);
        }
    });

I have no clue what else could be causing that error.

cspeer commented 3 years ago

Yeah, our code is a little more complex, that's why I tried breaking it down to something simple.

There must be something specific to your setup.

I thought the same thing that's why I tried figuring it out before I even posted here. But I didn't find anything the first time. But, what I did miss is that we specifically drop the queue after it's finished. Here is our original crawl ():

const crawl = async () => {
  const requestQueue = await Apify.openRequestQueue(dealer.handle + '-refresh');

  if (await requestQueue.isEmpty()) {
    Helpers.log('Loading URLs');
    const items = await Models.Item
      .query()
      .select('url')
      .whereNotNull('url')
      .where('source', dealer.handle)
    ;

    await Promise.all(_.map(items, async (item) => {
      return requestQueue.addRequest({ url: item.url });
    }));
  } else {
    Helpers.log('Resuming');
  }

  const config = {
    requestQueue,
    minConcurrency: dealer.spiderConfig.minConcurrency || 2,
    maxConcurrency: dealer.spiderConfig.maxConcurrency || 10,
    requestTimeoutSecs: dealer.spiderConfig.requestTimeoutSecs || 30,
    maxRequestRetries: dealer.spiderConfig.maxRequestRetries || 2,
    maxRequestsPerCrawl: dealer.spiderConfig.maxRequestsPerCrawl || 10000
  };

  let crawler;

  if (dealer.spiderConfig.render) {
    config.launchPuppeteerOptions = { headless: true, handleSIGINT: false };
    config.handlePageFunction     = handlePuppeteer;
    crawler = new Apify.PuppeteerCrawler(config);
  } else {
    config.handlePageFunction    = handleCheerio;
    config.forceResponseEncoding = dealer.spiderConfig.encoding;
    config.postResponseFunction = handleErrorResponse;
    crawler = new Apify.CheerioCrawler(config);
  }

  await crawler.run();

  stats.queue = await requestQueue.getInfo();

  await requestQueue.drop();
};

It's also worth noting that we downgraded to 0.22.3 about 4 days ago and the error hasnt occured since. Nonetheless I'll add your try {} catch () {} and report back asap.

Thanks!!!

mnmkng commented 3 years ago

Ok, now we're getting somewhere. There are 2 more calls to requestQueue after the crawler finishes. I don't get how the downgrade to 0.22.3 could have helped, because there were no changes to Request Queue between 0.22.3/4, but let's see.

await crawler.run();

stats.queue = await requestQueue.getInfo();

await requestQueue.drop();

Btw, I'm happy that you made use of postResponseFunction and forceResponseEncoding. We added this at request of another user and it's nice to see that it's being used by other people as well.

cspeer commented 3 years ago

I don't get how the downgrade to 0.22.3 could have helped

You and me both 😅

Alright, so what can I do about these 2 requests? Obviously I think they only make sense to execute after the crawler has finished, because getting stats about the queue and dropping it while the queue is still being processed kinda defeats the purpose. // Edit: Both of these calls are reclits from when the queue used the file system and not sqlite.

Btw, I'm happy that you made use of postResponseFunction and forceResponseEncoding. We added this at request of another user and it's nice to see that it's being used by other people as well.

Well, they are extremly helpful to us!!

mnmkng commented 3 years ago

Alright, so what can I do about these 2 requests? Obviously I think they only make sense to execute after the crawler has finished, because getting stats about the queue and dropping it while the queue is still being processed kinda defeats the purpose.

Sorry, I did not mean that you need to remove them. I was just happy that I finally have something to work with when trying to find the bug.

cspeer commented 3 years ago

Oh right, cheers! Do you still want me to implement your try-catch?

mnmkng commented 3 years ago

Yes please, would be nice to see if it has any impact.

cspeer commented 3 years ago

Alright, it took some time, but now it happened again. Your try-catch @mnmkng wasn't triggered though. But, maybe this full stack trace will help you narrow it down:

TypeError: The database connection is not open
  File "/home/*****-crawler/releases/20210126123011/node_modules/@apify/storage-local/src/emulators/request_queue_emulator.js", line 373, col 17, in RequestQueueEmulator._createTables
    this.db.prepare(`
  File "/home/*****-crawler/releases/20210126123011/node_modules/@apify/storage-local/src/emulators/request_queue_emulator.js", line 37, col 14, in new RequestQueueEmulator
    this._createTables();
  File "/home/*****-crawler/releases/20210126123011/node_modules/@apify/storage-local/src/resource_clients/request_queue.js", line 74, col 29, in RequestQueueClient._getEmulator
    this.emulator = new RequestQueueEmulator({
  File "/home/*****-crawler/releases/20210126123011/node_modules/@apify/storage-local/src/resource_clients/request_queue.js", line 84, col 18, in RequestQueueClient.get
    this._getEmulator().updateAccessedAtById(this.id);
  File "/home/*****-crawler/releases/20210126123011/node_modules/apify/build/storages/storage_manager.js", line 97, col 53, in StorageManager._getOrCreateStorage
    const existingStorage = await storageClient.get();
  File "/home/*****-crawler/releases/20210126123011/node_modules/apify/build/storages/storage_manager.js", line 52, col 46, in StorageManager.openStorage
    const storageObject = await this._getOrCreateStorage(idOrName, this.name, client);
  File "/home/*****-crawler/releases/20210126123011/node_modules/apify/build/storages/request_queue.js", line 582, col 20, in Object.exports.openRequestQueue
    return manager.openStorage(queueIdOrName, options);
  File "/home/*****-crawler/releases/20210126123011/spider/discover.js", line 145, col 30, in crawl
    requestQueue = await Apify.openRequestQueue(dealer.handle + '-discover');
  File "/home/*****-crawler/releases/20210126123011/spider/discover.js", line 250, col 11, in null.<anonymous>
    await crawl();

Also, here's a visual help from sentry: sentry

mnmkng commented 3 years ago

The above PR most likely fixes the problem, but unfortunately, apify@0.22.4 uses a specific version of @apify/storage-local@1.0.2 and therefore it won't get the update.

Would you be willing to update to apify@1.0.0 or do I need to release 0.22.5 with the updated package version?

cspeer commented 3 years ago

Great news! Did 1.0.0 already get the patch though? npmjs.org says 1.0.0 was last published 8 days ago. Also, I'm not quite sure what changed from 0.x to 1.0, so I'm a little reluctant to trade up for the newer model ;)

mnmkng commented 3 years ago

1.0.0 will get the update automatically with fresh install, because it uses ^1.0.3 as the version for @apify/storage-local.

mnmkng commented 3 years ago

You can read the release notes to see what's new.

cspeer commented 3 years ago

Cheers! I just updated the package. I'll check back in with you in a few days and give you an update. Thanks for your hard work @mnmkng

cspeer commented 3 years ago

I'm sorry, but the exact same error just occurred again. The stack trace appears to be the same:

TypeError: The database connection is not open
  File "/home/*****-crawler/releases/20210202194937/node_modules/@apify/storage-local/src/emulators/request_queue_emulator.js", line 373, col 17, in RequestQueueEmulator._createTables
    this.db.prepare(`
  File "/home/*****-crawler/releases/20210202194937/node_modules/@apify/storage-local/src/emulators/request_queue_emulator.js", line 37, col 14, in new RequestQueueEmulator
    this._createTables();
  File "/home/*****-crawler/releases/20210202194937/node_modules/@apify/storage-local/src/resource_clients/request_queue.js", line 74, col 29, in RequestQueueClient._getEmulator
    this.emulator = new RequestQueueEmulator({
  File "/home/*****-crawler/releases/20210202194937/node_modules/@apify/storage-local/src/resource_clients/request_queue.js", line 84, col 18, in RequestQueueClient.get
    this._getEmulator().updateAccessedAtById(this.id);
  File "/home/*****-crawler/releases/20210202194937/node_modules/apify/build/storages/storage_manager.js", line 97, col 53, in StorageManager._getOrCreateStorage
    const existingStorage = await storageClient.get();
  File "/home/*****-crawler/releases/20210202194937/node_modules/apify/build/storages/storage_manager.js", line 52, col 46, in StorageManager.openStorage
    const storageObject = await this._getOrCreateStorage(idOrName, this.name, client);
  File "/home/*****-crawler/releases/20210202194937/node_modules/apify/build/storages/request_queue.js", line 582, col 20, in Object.exports.openRequestQueue
    return manager.openStorage(queueIdOrName, options);
  File "/home/*****-crawler/releases/20210202194937/spider/discover.js", line 145, col 30, in crawl
    requestQueue = await Apify.openRequestQueue(dealer.handle + '-discover');
  File "/home/*****-crawler/releases/20210202194937/spider/discover.js", line 250, col 11, in null.<anonymous>
    await crawl();

I also double checked, this is 1.0.0:

$ cat releases/20210202194937/node_modules/apify/package.json | grep version
    "version": "1.0.0",
mnmkng commented 3 years ago

What version of @apify/storage-local do you have installed?

cspeer commented 3 years ago
$ cat node_modules/@apify/storage-local/package.json | grep version
    "version": "1.0.3",
mnmkng commented 3 years ago

That's good news in a way. The fix has been released in v1.0.4. Maybe package-lock.json prevented install of the newer version?

cspeer commented 3 years ago

Right you are. I'll give it another try and report back to you. Thanks!

cspeer commented 3 years ago

I think it now is safe to say that it's been fixed! Thanks @mnmkng :)