Closed cspeer closed 3 years ago
Yeah, there's a completely new SQLite based Request Queue in 0.22. This looks like some issue with teardown of the database connection.
Could you confirm that I understand the process correctly?
openRequestQueue
on the queue that just finished.Does everything else work correctly when the error is not thrown?
Almost, except I don't trigger the call to openRequestQueue
deliberately after the crawler finishes. Sorry for not providing sample code.
Here is a simplified version of what we use (I believe this is very similar to your example code from the docs):
const Apify = require('apify');
process.env.APIFY_LOCAL_STORAGE_DIR = 'spider/store';
Apify.main(async () => {
const requestQueue = await Apify.openRequestQueue('apify-test');
await requestQueue.addRequest({ url: 'https://www.versandhaus-schneider.de/' });
const crawler = new Apify.CheerioCrawler({
requestQueue,
async handlePageFunction ({ request, $, page }) {
console.log(`Processing ${request.url}`);
await Apify.utils.enqueueLinks({
page,
$,
selector: 'a[href]:not([rel="nofollow"])',
requestQueue,
pseudoUrls: [new Apify.PseudoUrl('https://www.versandhaus-schneider.de/[.*]')],
baseUrl: 'https://www.versandhaus-schneider.de/',
transformRequestFunction: (request) => {
if (!/(\/cPath\/|product_info\.php)/.test(request.url)) {
return null;
}
request.url = request.url.replace(/\?.+$/, '');
return request;
}
});
},
minConcurrency: 3,
maxConcurrency: 10,
maxRequestRetries: 0,
maxRequestsPerCrawl: 10000
});
await crawler.run();
console.log('Crawler finished.');
});
So, here is how it goes:
await crawler.run();
and (from what I can tell) it works as usualThe database connection is not open
is thrown from invoking const requestQueue = await Apify.openRequestQueue('apify-test');
So, maybe, because of concurrency, one of the last few calls actually closes the database connection before(!) the last call to the database is made. That would also explain why it only happens sometimes and not every time.
Does everything else work correctly when the error is not thrown?
Yes!
Thanks @mnmkng
Just a small and polite bump @AndreyBykov any ideas so far?
Hey @cspeer, sorry, finishing with something else at the moment, will probably have an update tomorrow. Will update here ASAP.
@cspeer I'm sorry but I can't reproduce it with the example code. How often does it happen to you? There must be something specific to your setup. Any post processing? Parallelization? I noticed in the stack trace that it's on line 120 and the store is called *-refresh
which tells me it's not as simple as the example you shared.
File "/home/gunfinder-crawler/releases/20210114184826/spider/refresh.js", line 120, col 36, in crawl
const requestQueue = await Apify.openRequestQueue(dealer.handle + '-refresh');
I still don't understand how an error can be thrown from a function that must have finished minutes ago. Can you change the content of your node_modules
in your build? Could you try adding this?
// utils.js - newStorageLocal()
process.on('exit', () => {
// TODO this is not public API, need to update
// storage local with some teardown
try {
storage.dbConnections.closeAllConnections();
} catch (err) {
console.error('Could not close database connections in the on-exit hook.');
console.error(err);
}
});
I have no clue what else could be causing that error.
Yeah, our code is a little more complex, that's why I tried breaking it down to something simple.
There must be something specific to your setup.
I thought the same thing that's why I tried figuring it out before I even posted here. But I didn't find anything the first time. But, what I did miss is that we specifically drop the queue after it's finished. Here is our original crawl ()
:
const crawl = async () => {
const requestQueue = await Apify.openRequestQueue(dealer.handle + '-refresh');
if (await requestQueue.isEmpty()) {
Helpers.log('Loading URLs');
const items = await Models.Item
.query()
.select('url')
.whereNotNull('url')
.where('source', dealer.handle)
;
await Promise.all(_.map(items, async (item) => {
return requestQueue.addRequest({ url: item.url });
}));
} else {
Helpers.log('Resuming');
}
const config = {
requestQueue,
minConcurrency: dealer.spiderConfig.minConcurrency || 2,
maxConcurrency: dealer.spiderConfig.maxConcurrency || 10,
requestTimeoutSecs: dealer.spiderConfig.requestTimeoutSecs || 30,
maxRequestRetries: dealer.spiderConfig.maxRequestRetries || 2,
maxRequestsPerCrawl: dealer.spiderConfig.maxRequestsPerCrawl || 10000
};
let crawler;
if (dealer.spiderConfig.render) {
config.launchPuppeteerOptions = { headless: true, handleSIGINT: false };
config.handlePageFunction = handlePuppeteer;
crawler = new Apify.PuppeteerCrawler(config);
} else {
config.handlePageFunction = handleCheerio;
config.forceResponseEncoding = dealer.spiderConfig.encoding;
config.postResponseFunction = handleErrorResponse;
crawler = new Apify.CheerioCrawler(config);
}
await crawler.run();
stats.queue = await requestQueue.getInfo();
await requestQueue.drop();
};
It's also worth noting that we downgraded to 0.22.3 about 4 days ago and the error hasnt occured since.
Nonetheless I'll add your try {} catch () {}
and report back asap.
Thanks!!!
Ok, now we're getting somewhere. There are 2 more calls to requestQueue
after the crawler finishes. I don't get how the downgrade to 0.22.3 could have helped, because there were no changes to Request Queue between 0.22.3/4, but let's see.
await crawler.run();
stats.queue = await requestQueue.getInfo();
await requestQueue.drop();
Btw, I'm happy that you made use of postResponseFunction
and forceResponseEncoding
. We added this at request of another user and it's nice to see that it's being used by other people as well.
I don't get how the downgrade to 0.22.3 could have helped
You and me both 😅
Alright, so what can I do about these 2 requests? Obviously I think they only make sense to execute after the crawler has finished, because getting stats about the queue and dropping it while the queue is still being processed kinda defeats the purpose. // Edit: Both of these calls are reclits from when the queue used the file system and not sqlite.
Btw, I'm happy that you made use of postResponseFunction and forceResponseEncoding. We added this at request of another user and it's nice to see that it's being used by other people as well.
Well, they are extremly helpful to us!!
Alright, so what can I do about these 2 requests? Obviously I think they only make sense to execute after the crawler has finished, because getting stats about the queue and dropping it while the queue is still being processed kinda defeats the purpose.
Sorry, I did not mean that you need to remove them. I was just happy that I finally have something to work with when trying to find the bug.
Oh right, cheers! Do you still want me to implement your try-catch?
Yes please, would be nice to see if it has any impact.
Alright, it took some time, but now it happened again. Your try-catch @mnmkng wasn't triggered though. But, maybe this full stack trace will help you narrow it down:
TypeError: The database connection is not open
File "/home/*****-crawler/releases/20210126123011/node_modules/@apify/storage-local/src/emulators/request_queue_emulator.js", line 373, col 17, in RequestQueueEmulator._createTables
this.db.prepare(`
File "/home/*****-crawler/releases/20210126123011/node_modules/@apify/storage-local/src/emulators/request_queue_emulator.js", line 37, col 14, in new RequestQueueEmulator
this._createTables();
File "/home/*****-crawler/releases/20210126123011/node_modules/@apify/storage-local/src/resource_clients/request_queue.js", line 74, col 29, in RequestQueueClient._getEmulator
this.emulator = new RequestQueueEmulator({
File "/home/*****-crawler/releases/20210126123011/node_modules/@apify/storage-local/src/resource_clients/request_queue.js", line 84, col 18, in RequestQueueClient.get
this._getEmulator().updateAccessedAtById(this.id);
File "/home/*****-crawler/releases/20210126123011/node_modules/apify/build/storages/storage_manager.js", line 97, col 53, in StorageManager._getOrCreateStorage
const existingStorage = await storageClient.get();
File "/home/*****-crawler/releases/20210126123011/node_modules/apify/build/storages/storage_manager.js", line 52, col 46, in StorageManager.openStorage
const storageObject = await this._getOrCreateStorage(idOrName, this.name, client);
File "/home/*****-crawler/releases/20210126123011/node_modules/apify/build/storages/request_queue.js", line 582, col 20, in Object.exports.openRequestQueue
return manager.openStorage(queueIdOrName, options);
File "/home/*****-crawler/releases/20210126123011/spider/discover.js", line 145, col 30, in crawl
requestQueue = await Apify.openRequestQueue(dealer.handle + '-discover');
File "/home/*****-crawler/releases/20210126123011/spider/discover.js", line 250, col 11, in null.<anonymous>
await crawl();
Also, here's a visual help from sentry:
The above PR most likely fixes the problem, but unfortunately, apify@0.22.4
uses a specific version of @apify/storage-local@1.0.2
and therefore it won't get the update.
Would you be willing to update to apify@1.0.0
or do I need to release 0.22.5
with the updated package version?
Great news! Did 1.0.0 already get the patch though? npmjs.org says 1.0.0 was last published 8 days ago. Also, I'm not quite sure what changed from 0.x to 1.0, so I'm a little reluctant to trade up for the newer model ;)
1.0.0 will get the update automatically with fresh install, because it uses ^1.0.3
as the version for @apify/storage-local
.
You can read the release notes to see what's new.
Cheers! I just updated the package. I'll check back in with you in a few days and give you an update. Thanks for your hard work @mnmkng
I'm sorry, but the exact same error just occurred again. The stack trace appears to be the same:
TypeError: The database connection is not open
File "/home/*****-crawler/releases/20210202194937/node_modules/@apify/storage-local/src/emulators/request_queue_emulator.js", line 373, col 17, in RequestQueueEmulator._createTables
this.db.prepare(`
File "/home/*****-crawler/releases/20210202194937/node_modules/@apify/storage-local/src/emulators/request_queue_emulator.js", line 37, col 14, in new RequestQueueEmulator
this._createTables();
File "/home/*****-crawler/releases/20210202194937/node_modules/@apify/storage-local/src/resource_clients/request_queue.js", line 74, col 29, in RequestQueueClient._getEmulator
this.emulator = new RequestQueueEmulator({
File "/home/*****-crawler/releases/20210202194937/node_modules/@apify/storage-local/src/resource_clients/request_queue.js", line 84, col 18, in RequestQueueClient.get
this._getEmulator().updateAccessedAtById(this.id);
File "/home/*****-crawler/releases/20210202194937/node_modules/apify/build/storages/storage_manager.js", line 97, col 53, in StorageManager._getOrCreateStorage
const existingStorage = await storageClient.get();
File "/home/*****-crawler/releases/20210202194937/node_modules/apify/build/storages/storage_manager.js", line 52, col 46, in StorageManager.openStorage
const storageObject = await this._getOrCreateStorage(idOrName, this.name, client);
File "/home/*****-crawler/releases/20210202194937/node_modules/apify/build/storages/request_queue.js", line 582, col 20, in Object.exports.openRequestQueue
return manager.openStorage(queueIdOrName, options);
File "/home/*****-crawler/releases/20210202194937/spider/discover.js", line 145, col 30, in crawl
requestQueue = await Apify.openRequestQueue(dealer.handle + '-discover');
File "/home/*****-crawler/releases/20210202194937/spider/discover.js", line 250, col 11, in null.<anonymous>
await crawl();
I also double checked, this is 1.0.0:
$ cat releases/20210202194937/node_modules/apify/package.json | grep version
"version": "1.0.0",
What version of @apify/storage-local
do you have installed?
$ cat node_modules/@apify/storage-local/package.json | grep version
"version": "1.0.3",
That's good news in a way. The fix has been released in v1.0.4. Maybe package-lock.json
prevented install of the newer version?
Right you are. I'll give it another try and report back to you. Thanks!
I think it now is safe to say that it's been fixed! Thanks @mnmkng :)
Describe the bug We just upgraded from 0.21.4 to 0.22.4. Now, we sometimes get a
The database connection is not open
when executing after the queue finishes:Here is an example log output:
Here's the full stack trace:
To Reproduce It's kind of impossible to reproduce right now as we didn't change our code and the error only occurs occasionally and only after upgrading the SDK.
System information:
Thanks, Chris