apify / crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://crawlee.dev
Apache License 2.0
15.47k stars 664 forks source link

Apify cli command to put failed requests back to queue #1363

Open hungnv-sr opened 2 years ago

hungnv-sr commented 2 years ago

Describe the feature There was a time in beta, handled and pending request in queue were in JSON format. If we wanted to retry some failed requests, we can simply put it back to pending request manually and run again. Now request queue is sqlite and there is no straightforward way to put it back in request queue again.

Motivation Occasionally, due to internet and other issues, some requests failed several time and were marked as failed. Some are very important requests and we don't want to run the entire application again. We also don't want to add it to request queue in handleFailedRequestFunction since it might flood our crawler with unnecessary requests.

Constraints We need a apify cli command to put failed request back to the queue.

B4nan commented 2 years ago

FYI in crawlee the default storage is again using JSON files (and keeps things in memory rather than in sqlite database), so you should be able to alter things manually again.

I still see a value in a CLI command you requested, especially because it would be storage agnostic (it would work regardless of the storage backend).

metalwarrior665 commented 2 years ago

I support the idea of having this in the CLI.

As a reference for other users that might need this on the Apify platform, you can do it with this actor - https://apify.com/lukaskrivka/rebirth-failed-requests

metalwarrior665 commented 2 years ago

More people are asking about this, let's put it to backlog.

mnmkng commented 2 years ago

Well, for this to work reliably, we have to add a flag that marks a request as failed. Apify API does not support it, so it would have to be Crawlee only feature. So I guess we could hack it the way we hack label through the private __crawlee prop in userData. Wdyt @B4nan ?

sbrow commented 1 year ago

Has there been any progress on this, or at least a quick guide on how to do it manually? I've looked at metalwarrior's code, and oh my goodness that is some seriously hacky stuff. It's very easy for requests to fail due to overloaded proxies rather than because the actual url is bad- Not being able to recover lost urls is a SERIOUS issue. while a CLI would be great- IMO, what would really be helpful is an api for examining and editing the state of the queue. i.e. Listing all failed urls, putting them back in the queue as though they're fresh, etc.

You shouldn't have to write custom requestHandlers for bogus urls in order to do this.

I realize this sounds very negative, and I'd like to say that Crawlee is an excellent crawler with some awesome features. It's just that this particular shortfall makes it very difficult to iterate crawlers on large datasets.

mnmkng commented 1 year ago

@sbrow you could use https://www.npmjs.com/package/@apify/storage-local It uses SQLite for request queue so you can then use any SQLite GUI application to inspect or manipulate the requests. You can inject it with the storageClient configuration option of Crawlee.

sbrow commented 1 year ago

@mnmkng Thank you, I will definitely look into that. Would that be easier than editing the flat files that crawlee normally generates? I've been trying to figure out how they work, but to no avail. i.e. Where do failed requests go? Does deleting a request from storage/request_queues/default/*.json reset the request "state"? etc.

mnmkng commented 1 year ago

Crawlee does not understand failed or successful requests on the queue level. It knows only pending or handled requests. A request that fails by exhausting all retries is also marked as handled in the end. We have request queue v2 on the roadmap, but it got derailed by other activities with higher prio.

metalwarrior665 commented 1 year ago

This code does what you need for the Apify platform. For local, the only difference is that instead of using the Apify client, you will use the filesystem for listing requests and updating them. I see this question is repeating so we should probably release this for local dev.

sbrow commented 1 year ago

sqlite ended up being a dead end for me. It makes it much easier to find urls, but you have the same issues in attempting to restore them. That being said, I did make some progress.

  1. Find all requests in your queue that have orderNo: null and retryCount > 0.
  2. Reset their retryCount to 0
  3. clear their errorMessages
  4. remove json.handledAt
  5. Update the stats file to reset failedRequests, etc. (TBD)

I wrote a shell script that will do all of this (destructively):

#!/usr/bin/env bash 
read -r -d '' JQ_SCRIPT <<JQ
if .orderNo != null or .retryCount == 0 then
.
else
. + {
    retryCount: 0,
    orderNo: (now * 1000 | floor),
    json: ((.json | fromjson) + {retryCount: 0, errorMessages: []}) | del(.handledAt) | tojson
}
end
JQ

for f in ./storage/request_queues/default/*.json; do
    cp   $f "$f.tmp"
    jq "$JQ_SCRIPT" "$f.tmp" > $f
    rm   "$f.tmp"
done

# TODO: Fix ./storage/key_value_stores/SDK_CRAWLER_STATISTICS_0.json

@metalwarrior665 I appreciate your code! However I am only using crawlee, not apify, and I wasn't a fan of how it added a bunch of "dummy" urls into the system, since I already have thousands of urls that are hard to sift through.

Thank you both for your prompt replies!

metalwarrior665 commented 1 year ago

My code was meant as a separate process and you can essentially replace BasicCrawler with a for loop to remove any extra Crawlee stuff. I wanted to mainly show what needs to be done to identify the failed request (errorMessages.length > 0 and errorMessages.length > retryCount) and renew it (reset errorMessages, retryCount and handledAt). But the shell script will do the same thing, good to have that, thanks!

sbrow commented 1 year ago

Really, it's nothing personal. I'm just in the middle of a scrape with over 30k urls locally via crawlee, and I'd like to be able to recover my failed requests rather than start all over again with an apify Actor :)