Open hungnv-sr opened 2 years ago
FYI in crawlee the default storage is again using JSON files (and keeps things in memory rather than in sqlite database), so you should be able to alter things manually again.
I still see a value in a CLI command you requested, especially because it would be storage agnostic (it would work regardless of the storage backend).
I support the idea of having this in the CLI.
As a reference for other users that might need this on the Apify platform, you can do it with this actor - https://apify.com/lukaskrivka/rebirth-failed-requests
More people are asking about this, let's put it to backlog.
Well, for this to work reliably, we have to add a flag that marks a request as failed. Apify API does not support it, so it would have to be Crawlee only feature. So I guess we could hack it the way we hack label
through the private __crawlee
prop in userData
. Wdyt @B4nan ?
Has there been any progress on this, or at least a quick guide on how to do it manually? I've looked at metalwarrior's code, and oh my goodness that is some seriously hacky stuff. It's very easy for requests to fail due to overloaded proxies rather than because the actual url is bad- Not being able to recover lost urls is a SERIOUS issue. while a CLI would be great- IMO, what would really be helpful is an api for examining and editing the state of the queue. i.e. Listing all failed urls, putting them back in the queue as though they're fresh, etc.
You shouldn't have to write custom requestHandlers
for bogus urls in order to do this.
I realize this sounds very negative, and I'd like to say that Crawlee is an excellent crawler with some awesome features. It's just that this particular shortfall makes it very difficult to iterate crawlers on large datasets.
@sbrow you could use https://www.npmjs.com/package/@apify/storage-local It uses SQLite for request queue so you can then use any SQLite GUI application to inspect or manipulate the requests. You can inject it with the storageClient configuration option of Crawlee.
@mnmkng Thank you, I will definitely look into that. Would that be easier than editing the flat files that crawlee normally generates? I've been trying to figure out how they work, but to no avail. i.e. Where do failed requests go? Does deleting a request from storage/request_queues/default/*.json
reset the request "state"? etc.
Crawlee does not understand failed or successful requests on the queue level. It knows only pending or handled requests. A request that fails by exhausting all retries is also marked as handled in the end. We have request queue v2 on the roadmap, but it got derailed by other activities with higher prio.
This code does what you need for the Apify platform. For local, the only difference is that instead of using the Apify client, you will use the filesystem for listing requests and updating them. I see this question is repeating so we should probably release this for local dev.
sqlite ended up being a dead end for me. It makes it much easier to find urls, but you have the same issues in attempting to restore them. That being said, I did make some progress.
orderNo: null
and retryCount
> 0.retryCount
to 0errorMessage
sjson.handledAt
I wrote a shell script that will do all of this (destructively):
#!/usr/bin/env bash
read -r -d '' JQ_SCRIPT <<JQ
if .orderNo != null or .retryCount == 0 then
.
else
. + {
retryCount: 0,
orderNo: (now * 1000 | floor),
json: ((.json | fromjson) + {retryCount: 0, errorMessages: []}) | del(.handledAt) | tojson
}
end
JQ
for f in ./storage/request_queues/default/*.json; do
cp $f "$f.tmp"
jq "$JQ_SCRIPT" "$f.tmp" > $f
rm "$f.tmp"
done
# TODO: Fix ./storage/key_value_stores/SDK_CRAWLER_STATISTICS_0.json
@metalwarrior665 I appreciate your code! However I am only using crawlee, not apify, and I wasn't a fan of how it added a bunch of "dummy" urls into the system, since I already have thousands of urls that are hard to sift through.
Thank you both for your prompt replies!
My code was meant as a separate process and you can essentially replace BasicCrawler with a for loop to remove any extra Crawlee stuff. I wanted to mainly show what needs to be done to identify the failed request (errorMessages.length > 0 and errorMessages.length > retryCount) and renew it (reset errorMessages, retryCount and handledAt). But the shell script will do the same thing, good to have that, thanks!
Really, it's nothing personal. I'm just in the middle of a scrape with over 30k urls locally via crawlee, and I'd like to be able to recover my failed requests rather than start all over again with an apify Actor :)
Describe the feature There was a time in beta, handled and pending request in queue were in JSON format. If we wanted to retry some failed requests, we can simply put it back to pending request manually and run again. Now request queue is sqlite and there is no straightforward way to put it back in request queue again.
Motivation Occasionally, due to internet and other issues, some requests failed several time and were marked as failed. Some are very important requests and we don't want to run the entire application again. We also don't want to add it to request queue in handleFailedRequestFunction since it might flood our crawler with unnecessary requests.
Constraints We need a apify cli command to put failed request back to the queue.