use Generator2 lead fetcher fail

whjshj commented 1 week ago

When I use Generator2 to generate fetch requests to download a webpage, and set the number of threads to 1, it results in a timeout being triggered (long timeout = conf.getInt("mapreduce.task.timeout", 10 60 1000)), causing the download task to terminate.

sebastian-nagel commented 1 week ago

Hi @whjshj, could you share your configuration (at least, all custom-set properties related to Generator2, Generator, URLPartitioner and Fetcher? Also sharing the log files (job client stdout and hadoop.log or task logs) would help to debug the issue, thanks!

Two comments so far:

Generator2 (also Generator) is the wrong tool if it's about a short fetch list of a single or few web pages and if only a single fetcher task is used - I assume this because it would be quite inefficient to have two tasks with a single thread each instead of one task and two threads. There is a tool FreeGenerator which allows you to generate a segment (fetch list) quickly from a list of URLs.
by default, Nutch is configured to be polite. This includes slowing down if a server responds with a server error or a "Slow Down", see the properties fetcher.exceptions.per.queue.delay and http.robots.503.defer.visits. When the fetch list includes only URLs from such a site, then crawling becomes stale and the timeout will be reached. However, if not configured otherwise, the Fetcher is shutting down already at 50% of the MapReduce task timeout to prevent that a task is failed.

whjshj commented 1 week ago

hello @sebastian-nagel I'm using the settings from the website https://github.com/commoncrawl/cc-nutch-example. I only changed the number of threads to 1. In the initial fetching phase, it runs normally. However, after some time, there is only one thread alive, but it's just waiting, even though there is still data in the queue. The data isn't being selected because it exceeds the maximum number of threads, and the only active thread isn't processing the data. This phenomenon is quite strange. Then, a timeout is triggered, which is what you mentioned as the maptask task timeout. Have you encountered this situation before? [Uploading hadoop.log…]()

sebastian-nagel commented 1 week ago

only one thread alive, but it's just waiting, even though there is still data in the queue

Then the queue is blocked because the host of this queue responded (repeatedly) with an HTTP status code indicating a server error. This is quite common for wider web crawls, but it shouldn't happen if you crawl your intranet or own server.

There are two options to ensure that the fetcher is fetching:

a hard time limit in minutes: fetcher.timelimit.mins
a throughput threshold: fetcher.throughput.threshold.pages. By default it is checked after 5 minutes. Fetching is stopped, if the throughput in fetched pages per second drops below the threshold.

If either time limit or throughput threshold are hit, the current fetching cycle is stopped and the output is written to disk/HDFS. The script will then continue.

In order to figure out the reason of the slow fetching, I need the log file.

whjshj commented 4 days ago

hadoop.log [Uploading hadoop.log…]() I have set the two parameters you mentioned, but I feel that they are not the cause. Even if a server error occurs and the HTTP request times out, it should move on to download the next webpage instead of behaving like a log. In my previous reply, I already uploaded the log. Can you see it? I will upload the log again this time when I go back.

sebastian-nagel commented 4 days ago

Hi @whjshj, according to the hadoop.log, the fetch job fails in the reduce phase when writing the WARC files. The native libraries for the language detector are not installed:

please see the README.md for how to install them, in short: sudo apt install libcld2-0 libcld2-dev. See also the README of cc-nutch-example
alternatively, disable the language identification per configuration property warc.detect.language. In the example, you'd need to modify the crawl.sh script, line 45

whjshj commented 3 days ago

hello @sebastian-nagel Hello, I have identified the cause of the error when writing the WARC file, and I've already resolved the issue. Please take a look at the section before writing the WARC file. I have taken a screenshot. Can you see it?

sebastian-nagel commented 3 days ago

I have identified the cause of the error when writing the WARC file, and I've already resolved the issue.

Great!

Ok, I see:

when fetching the robots.txt of 1-dot-name-meaning.appspot.com the server responded with an error indicating that the server cannot be crawled temporarily and fetches for this host are delayed by 5 minutes:

2024-11-13 19:41:31,060 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 54 fetching http://1-dot-name-meaning.appspot.com/numerology/expression/Goundamani%200 (queue crawl delay=5000ms)
2024-11-13 19:41:31,435 INFO o.a.n.f.FetcherThread [FetcherThread] Defer visits for queue 1-dot-name-meaning.appspot.com : http://1-dot-name-meaning.appspot.com/numerology/expression/Goundamani%200
2024-11-13 19:41:31,436 INFO o.a.n.f.FetchItemQueues [FetcherThread] * queue: 1-dot-name-meaning.appspot.com >> delayed next fetch by 300000 ms

see http.robots.503.defer.visits and related properties
NUTCH-2573
now RFC 9309, section 2.3.1.4

note: in order to get more details logged, set the log level for org.apache.nutch.fetcher to DEBUG. I recommend to set also org.apache.nutch.protocol to level DEBUG. This should give you more information why the fetching failed.

the last fetch:

2024-11-13 19:45:34,062 INFO o.a.n.f.FetchItemQueues [FetcherThread] Fetching http://1027kord.com/high-school-teenage-contraception/%200

about one minute later the fetch is aborted. The 50 slots in the queues are all occupied with URLs from the host with the deferred visits:

2024-11-13 19:46:34,793 WARN o.a.n.f.Fetcher [LocalJobRunner Map Task Executor #0] Aborting with 50 queued fetch items in 34 queues (queue feeder still alive).
2024-11-13 19:46:34,794 INFO o.a.n.f.FetchItemQueues [LocalJobRunner Map Task Executor #0] * queue: 1-dot-name-meaning.appspot.com >> dropping!
2024-11-13 19:46:34,794 INFO o.a.n.f.FetchItemQueues [LocalJobRunner Map Task Executor #0] Emptied all queues: 1 queues with 50 items

the total size of all queues is rather short, because you have only a single thread. There's a property fetcher.queue.depth.multiplier (default: 50) which is multiplied by the number of threads.
- if you have a rather diverse fetch list with potentially slow or delayed hosts:
- increase the number of fetcher threads
- also increase the fetcher.queue.depth.multiplier
- this avoids that one or few hosts can occupy all queues
- having a single thread is typical for intranet or sitesearch crawls
there is an open issue and PR (not yet merged) which improves how the fetcher shuts down the threads in such situations, see NUTCH-3072. But it does not avoid it.
one point, I do not understand: if "mapreduce.task.timeout is configured to be 10 minutes and fetcher.threads.timeout.divisor is 2 (both defaults), then the "aborting" should happen 5 minutes after the last fetch
otherwise: please use fetcher.timelimit.mins and fetcher.throughput.threshold.pages to ensure that a slow fetcher is shutting down. See the comment few days ago. Please note that fetcher.timelimit.mins is dynamically set in the crawl.sh script of the example

whjshj commented 2 days ago

"Thank you for your response; my confusion has been resolved. May I ask about the current situation of using Nutch-cc to crawl web pages? For example, in an iterative download, if a total of 1000 web pages need to be downloaded, how many of them are successfully downloaded in the end?"

sebastian-nagel commented 2 days ago

This totally depends on the fetch list:

0 pages - if it's a single site disallowed by robots.txt
very few - if the site implements anti-bot measures
1000: with careful and polite settings and if the crawled site generally admits crawling
40-80% for a mixed fetch list is realistic. For the recent Common Crawl crawls it's about 70% successful fetches, see https://commoncrawl.github.io/cc-crawl-statistics/plots/crawlermetrics

whjshj commented 2 days ago

Thank you very much for your response. In the fetch stage, regarding the map phase tasks, which is the downloading process, do you have any recommendations for the map-related configuration? Currently, I have set each map task to 1 core and 2GB of memory. Is this configuration reasonable?

sebastian-nagel commented 2 days ago

I have set each map task to 1 core and 2GB of memory.

Yes, possible, under the assumption that

you are fine without any parallelization and consequently, overall slow fetching
you have only a single map task. That is, it runs in local mode but not as a job on a Hadoop cluster with multiple map tasks.

If you want to scale up, it's more efficient to parallelize first using threads (up to several hundred threads). Of course, more threads mean higher memory requirements to buffer the incoming data. Also scaling up requires to adjust many more parameters to your set up and context: connection pools, timeouts, etc.

whjshj commented 1 day ago

Thank you for your response. I am currently looking to deploy Nutch-cc on a large scale. Could you suggest some recommended configurations? For example, how many CPU cores and how much memory should be allocated to each map task and each reduce task? Additionally, during the fetch phase, what would be an appropriate number of concurrent download threads to set?

commoncrawl / nutch

use Generator2 lead fetcher fail #32