Open terchris opened 2 years ago
I encountered the same problem, if the call twice, the second call to requestHandler always does not respond.
Then I realized that the requestQueue was probably cached.
So I tried to add an
await requestQueue.drop(); at the end to clear the cache and it seems to work.
I have a related issue in crawlee 3.3.3. The Datasets seem indeed not be purged by PURGE_ON_START or purgeDefaultStorage. I wrote this small test to actually confirm it:
it("shows that purgeDefaultstorage doesn't do anything?", async () => {
let crawler = new CheerioCrawler({
async requestHandler({}) {
await Dataset.pushData({item: "asdf"});
}
});
await crawler.run(["http://www.google.de"]);
await purgeDefaultStorages();
await crawler.run(["http://www.bing.de"]);
expect((await Dataset.getData()).count).to.be.eq(1);
});
So I believe its indeed a Bug and the purge commands don't seem to work. However I believe @terchris is running against the problem that the requestQueue needs to be dropped as @xialer pointed out.
It still seems to be an issue. Any updates?
Purging works as expected, the problem here is its rather internal API that is not supposed to be working as you guys are trying to use it - crawlee purges default storages automatically (in other words, the PURGE_ON_START
defaults to true), and it is supposed to happen only once, as this purgeDefaultStorage
method is called multiple times from various places and we want it to execute only once (on the very first call).
I guess we could rework this a bit to support explicit purging too. For now you can try this:
const config = Configuration.getGlobalConfig();
const client = config.getStorageClient();
await client.purge?.();
I have the same problem reported here and have tried the solution you proposed @B4nan but no luck.
const crawlPage = async (seedUrl: string, onDocument: (string) => void) => {
const crawler = new PlaywrightCrawler({
launchContext: { launchOptions: { headless: true } },
maxRequestRetries: 1,
requestHandlerTimeoutSecs: 20,
maxRequestsPerCrawl: 20,
async requestHandler({ request, page, enqueueLinks }) {
try {
const html = await page.evaluate('document.body.innerHTML');
// Publish this html
onDocument(html);
// If the page is part of a seed, visit the links
await enqueueLinks({ strategy: EnqueueStrategy.SameHostname });
} catch (err) {
log.warn('Error processing url: ' + request.url);
}
},
});
await crawler.addRequests([seedUrl]);
await crawler.run();
try {
const config = Configuration.getGlobalConfig();
const client = config.getStorageClient();
await client.purge?.();
} catch (err) {
log.warn('Failed to purge storage client');
}
}
Running this fails because the second onDocument
is never called. The page was already crawled.
test("crawl multiple URLs", async () => {
const onDocument = jest.fn();
await crawlPage("https://moveo.ai", onDocument);
expect(onDocument).toHaveBeenCalled();
const onDocumentSecond = jest.fn();
await crawlPage("https://moveo.ai", onDocument);
expect(onDocumentSecond).toHaveBeenCalled();
});
I actually get an error where the purge()
is trying to delete a file.
Could not find file at /storage/key_value_stores/default/SDK_SESSION_POOL_STATE.json
Running this fails because the second onDocument is never called. The page was already crawled.
That test seems to be wrong, you are not passing the onDocumentSecond
in the second parameter, so it won't be called. Should be this instead:
test("crawl multiple URLs", async () => {
const onDocument = jest.fn();
await crawlPage("https://moveo.ai/", onDocument);
expect(onDocument).toHaveBeenCalled();
const onDocumentSecond = jest.fn();
await crawlPage("https://moveo.ai/", onDocumentSecond); // <--
expect(onDocumentSecond).toHaveBeenCalled();
});
I actually get an error where the purge() is trying to delete a file.
And you get that from some actual usage, or in a test case? Are you calling that method in parallel?
UPDATE: @B4nan, You are right, in my attempt to cleanup the code to paste here I made a typo. I actually check onDocumentSecond
in my second expect()
.
The first time the method runs, it finds multiple pages, so onDocument
is called around 22 times (maxRequestsPerCrawl=20
).
The second time, the new mocked function onDocumentSecond
isn't called because some state from the first run is stored somewhere, possibly in a variable within a module. If we had a teardown()
, purge
, or similar method to clean up the entire state, I believe this code would function properly.
I've tried various alternatives that I found in several issues similar to this one. I'm currently documenting them and planning to initiate a discussion with my findings.
Are you sure you are using latest version? The run
method itself is already doing the necessary cleanup:
Yeah. I'm using 3.4.0
I'm having the same problem in 3.4.2
For future reference. I solved the problem using the persistStorage: false
configuration. You need to set it each time you instantiate a PlaywrightCrawler
instance.
const crawlPage = async (seedUrl: string, onDocument: (string) => void) => {
const crawler = new PlaywrightCrawler({
launchContext: { launchOptions: { headless: true } },
maxRequestRetries: 1,
requestHandlerTimeoutSecs: 20,
maxRequestsPerCrawl: 20,
async requestHandler({ request, page, enqueueLinks }) {
try {
const html = await page.evaluate('document.body.innerHTML');
// Publish this html
onDocument(html);
// If the page is part of a seed, visit the links
await enqueueLinks({ strategy: EnqueueStrategy.SameHostname });
} catch (err) {
log.warn('Error processing url: ' + request.url);
}
},
},
new Configuration({ persistStorage: false })); // <---- Configuration
await crawler.addRequests([seedUrl]);
await crawler.run();
}
it("shows that purgeDefaultstorage doesn't do anything?", async () => {
let crawler = new CheerioCrawler({
async requestHandler({}) {
await Dataset.pushData({item: "asdf"});
}
},
new Configuration({persistStorage: false})
);
await crawler.run(["http://www.google.de"]);
await purgeDefaultStorages();
await crawler.run(["http://www.google.de"]);
expect((await Dataset.getData()).count).to.be.eq(1);
});
I tried this @germanattanasio, but this unit test stills fails, where I am wrong?
expect((await Dataset.getData()).count).to.be.eq(1);
This call will use global config, therefore the same storage. You have three options:
Configuration.set()
(or use env vars, namely https://crawlee.dev/docs/guides/configuration#crawlee_purge_on_start)Dataset.open(null, { config })
and call getData
on thatcrawler.getData()
which respects the config you pass to the crawler
Which package is this bug report for? If unsure which one to select, leave blank
No response
Issue description
I suspect that the client.purge is not run when crawlee starts as described in the doc. Setting CRAWLEE_PURGE_ON_START=true or false has no effect. All the files are still in key_value_stores/default/ eg: SDK_CRAWLER_STATISTICS_12.json . I have set CRAWLEE_STORAGE_DIR="tmpfilesystem/crawlee" so it might be related.
If I delete the directory tmpfilesystem/crawlee and run the code below. This works just fine, the website is scraped and the title of the website is displayed. The second time the code is run it does not work. If I delete all fines and try again, then it works.
This is debugging from the first run:
This is debugging from the second run:
Code sample
Package version
crawlee@3.1.0
Node.js version
v16.17.0
Operating system
mac os
Apify platform
Priority this issue should have
Medium (should be fixed soon)
I have tested this on the
next
releaseNo response
Other context
No response