commoncrawl / nutch

Common Crawl fork of Apache Nutch
Apache License 2.0
28 stars 2 forks source link

use Generator2 lead fetcher fail #32

Open whjshj opened 1 week ago

whjshj commented 1 week ago

When I use Generator2 to generate fetch requests to download a webpage, and set the number of threads to 1, it results in a timeout being triggered (long timeout = conf.getInt("mapreduce.task.timeout", 10 60 1000)), causing the download task to terminate.

sebastian-nagel commented 1 week ago

Hi @whjshj, could you share your configuration (at least, all custom-set properties related to Generator2, Generator, URLPartitioner and Fetcher? Also sharing the log files (job client stdout and hadoop.log or task logs) would help to debug the issue, thanks!

Two comments so far:

whjshj commented 1 week ago

hello @sebastian-nagel I'm using the settings from the website https://github.com/commoncrawl/cc-nutch-example. I only changed the number of threads to 1. In the initial fetching phase, it runs normally. However, after some time, there is only one thread alive, but it's just waiting, even though there is still data in the queue. The data isn't being selected because it exceeds the maximum number of threads, and the only active thread isn't processing the data. This phenomenon is quite strange. Then, a timeout is triggered, which is what you mentioned as the maptask task timeout. Have you encountered this situation before? [Uploading hadoop.log…]()

sebastian-nagel commented 1 week ago

only one thread alive, but it's just waiting, even though there is still data in the queue

Then the queue is blocked because the host of this queue responded (repeatedly) with an HTTP status code indicating a server error. This is quite common for wider web crawls, but it shouldn't happen if you crawl your intranet or own server.

There are two options to ensure that the fetcher is fetching:

If either time limit or throughput threshold are hit, the current fetching cycle is stopped and the output is written to disk/HDFS. The script will then continue.

In order to figure out the reason of the slow fetching, I need the log file.

whjshj commented 4 days ago

hadoop.log [Uploading hadoop.log…]() I have set the two parameters you mentioned, but I feel that they are not the cause. Even if a server error occurs and the HTTP request times out, it should move on to download the next webpage instead of behaving like a log. In my previous reply, I already uploaded the log. Can you see it? I will upload the log again this time when I go back.

sebastian-nagel commented 4 days ago

Hi @whjshj, according to the hadoop.log, the fetch job fails in the reduce phase when writing the WARC files. The native libraries for the language detector are not installed:

whjshj commented 3 days ago

hello @sebastian-nagel Hello, I have identified the cause of the error when writing the WARC file, and I've already resolved the issue. Please take a look at the section before writing the WARC file. I have taken a screenshot. Can you see it?

截屏2024-11-19 10 14 45
sebastian-nagel commented 3 days ago

I have identified the cause of the error when writing the WARC file, and I've already resolved the issue.

Great!

Ok, I see:

whjshj commented 2 days ago

"Thank you for your response; my confusion has been resolved. May I ask about the current situation of using Nutch-cc to crawl web pages? For example, in an iterative download, if a total of 1000 web pages need to be downloaded, how many of them are successfully downloaded in the end?"

sebastian-nagel commented 2 days ago

This totally depends on the fetch list:

whjshj commented 2 days ago

Thank you very much for your response. In the fetch stage, regarding the map phase tasks, which is the downloading process, do you have any recommendations for the map-related configuration? Currently, I have set each map task to 1 core and 2GB of memory. Is this configuration reasonable?

sebastian-nagel commented 2 days ago

I have set each map task to 1 core and 2GB of memory.

Yes, possible, under the assumption that

If you want to scale up, it's more efficient to parallelize first using threads (up to several hundred threads). Of course, more threads mean higher memory requirements to buffer the incoming data. Also scaling up requires to adjust many more parameters to your set up and context: connection pools, timeouts, etc.

whjshj commented 1 day ago

Thank you for your response. I am currently looking to deploy Nutch-cc on a large scale. Could you suggest some recommended configurations? For example, how many CPU cores and how much memory should be allocated to each map task and each reduce task? Additionally, during the fetch phase, what would be an appropriate number of concurrent download threads to set?