Open whjshj opened 1 week ago
Hi @whjshj, could you share your configuration (at least, all custom-set properties related to Generator2, Generator, URLPartitioner and Fetcher? Also sharing the log files (job client stdout and hadoop.log or task logs) would help to debug the issue, thanks!
Two comments so far:
fetcher.exceptions.per.queue.delay
and http.robots.503.defer.visits
. When the fetch list includes only URLs from such a site, then crawling becomes stale and the timeout will be reached. However, if not configured otherwise, the Fetcher is shutting down already at 50% of the MapReduce task timeout to prevent that a task is failed.hello @sebastian-nagel I'm using the settings from the website https://github.com/commoncrawl/cc-nutch-example. I only changed the number of threads to 1. In the initial fetching phase, it runs normally. However, after some time, there is only one thread alive, but it's just waiting, even though there is still data in the queue. The data isn't being selected because it exceeds the maximum number of threads, and the only active thread isn't processing the data. This phenomenon is quite strange. Then, a timeout is triggered, which is what you mentioned as the maptask task timeout. Have you encountered this situation before? [Uploading hadoop.log…]()
only one thread alive, but it's just waiting, even though there is still data in the queue
Then the queue is blocked because the host of this queue responded (repeatedly) with an HTTP status code indicating a server error. This is quite common for wider web crawls, but it shouldn't happen if you crawl your intranet or own server.
There are two options to ensure that the fetcher is fetching:
fetcher.timelimit.mins
fetcher.throughput.threshold.pages
. By default it is checked after 5 minutes. Fetching is stopped, if the throughput in fetched pages per second drops below the threshold.If either time limit or throughput threshold are hit, the current fetching cycle is stopped and the output is written to disk/HDFS. The script will then continue.
In order to figure out the reason of the slow fetching, I need the log file.
hadoop.log [Uploading hadoop.log…]() I have set the two parameters you mentioned, but I feel that they are not the cause. Even if a server error occurs and the HTTP request times out, it should move on to download the next webpage instead of behaving like a log. In my previous reply, I already uploaded the log. Can you see it? I will upload the log again this time when I go back.
Hi @whjshj, according to the hadoop.log, the fetch job fails in the reduce phase when writing the WARC files. The native libraries for the language detector are not installed:
sudo apt install libcld2-0 libcld2-dev
. See also the README of cc-nutch-examplewarc.detect.language
. In the example, you'd need to modify the crawl.sh script, line 45hello @sebastian-nagel Hello, I have identified the cause of the error when writing the WARC file, and I've already resolved the issue. Please take a look at the section before writing the WARC file. I have taken a screenshot. Can you see it?
I have identified the cause of the error when writing the WARC file, and I've already resolved the issue.
Great!
Ok, I see:
when fetching the robots.txt of 1-dot-name-meaning.appspot.com the server responded with an error indicating that the server cannot be crawled temporarily and fetches for this host are delayed by 5 minutes:
2024-11-13 19:41:31,060 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 54 fetching http://1-dot-name-meaning.appspot.com/numerology/expression/Goundamani%200 (queue crawl delay=5000ms)
2024-11-13 19:41:31,435 INFO o.a.n.f.FetcherThread [FetcherThread] Defer visits for queue 1-dot-name-meaning.appspot.com : http://1-dot-name-meaning.appspot.com/numerology/expression/Goundamani%200
2024-11-13 19:41:31,436 INFO o.a.n.f.FetchItemQueues [FetcherThread] * queue: 1-dot-name-meaning.appspot.com >> delayed next fetch by 300000 ms
http.robots.503.defer.visits
and related propertiesnote: in order to get more details logged, set the log level for org.apache.nutch.fetcher
to DEBUG
. I recommend to set also org.apache.nutch.protocol
to level DEBUG
. This should give you more information why the fetching failed.
the last fetch:
2024-11-13 19:45:34,062 INFO o.a.n.f.FetchItemQueues [FetcherThread] Fetching http://1027kord.com/high-school-teenage-contraception/%200
about one minute later the fetch is aborted. The 50 slots in the queues are all occupied with URLs from the host with the deferred visits:
2024-11-13 19:46:34,793 WARN o.a.n.f.Fetcher [LocalJobRunner Map Task Executor #0] Aborting with 50 queued fetch items in 34 queues (queue feeder still alive).
2024-11-13 19:46:34,794 INFO o.a.n.f.FetchItemQueues [LocalJobRunner Map Task Executor #0] * queue: 1-dot-name-meaning.appspot.com >> dropping!
2024-11-13 19:46:34,794 INFO o.a.n.f.FetchItemQueues [LocalJobRunner Map Task Executor #0] Emptied all queues: 1 queues with 50 items
the total size of all queues is rather short, because you have only a single thread. There's a property fetcher.queue.depth.multiplier
(default: 50) which is multiplied by the number of threads.
fetcher.queue.depth.multiplier
there is an open issue and PR (not yet merged) which improves how the fetcher shuts down the threads in such situations, see NUTCH-3072. But it does not avoid it.
one point, I do not understand: if "mapreduce.task.timeout
is configured to be 10 minutes and fetcher.threads.timeout.divisor
is 2 (both defaults), then the "aborting" should happen 5 minutes after the last fetch
otherwise: please use fetcher.timelimit.mins
and fetcher.throughput.threshold.pages
to ensure that a slow fetcher is shutting down. See the comment few days ago. Please note that fetcher.timelimit.mins
is dynamically set in the crawl.sh script of the example
"Thank you for your response; my confusion has been resolved. May I ask about the current situation of using Nutch-cc to crawl web pages? For example, in an iterative download, if a total of 1000 web pages need to be downloaded, how many of them are successfully downloaded in the end?"
This totally depends on the fetch list:
Thank you very much for your response. In the fetch stage, regarding the map phase tasks, which is the downloading process, do you have any recommendations for the map-related configuration? Currently, I have set each map task to 1 core and 2GB of memory. Is this configuration reasonable?
I have set each map task to 1 core and 2GB of memory.
Yes, possible, under the assumption that
If you want to scale up, it's more efficient to parallelize first using threads (up to several hundred threads). Of course, more threads mean higher memory requirements to buffer the incoming data. Also scaling up requires to adjust many more parameters to your set up and context: connection pools, timeouts, etc.
Thank you for your response. I am currently looking to deploy Nutch-cc on a large scale. Could you suggest some recommended configurations? For example, how many CPU cores and how much memory should be allocated to each map task and each reduce task? Additionally, during the fetch phase, what would be an appropriate number of concurrent download threads to set?
When I use Generator2 to generate fetch requests to download a webpage, and set the number of threads to 1, it results in a timeout being triggered (long timeout = conf.getInt("mapreduce.task.timeout", 10 60 1000)), causing the download task to terminate.