Closed sebastian-nagel closed 7 years ago
Found the reason in record 91421 (see modified last line in snippet):
WARC/1.0
WARC-Type: response
WARC-Date: 2016-07-31T00:21:52Z
WARC-Record-ID: <urn:uuid:1f3b4b65-48e1-4eb9-97ab-4da22b693739>
Content-Length: 1044320
Content-Type: application/http; msgtype=response
WARC-Warcinfo-ID: <urn:uuid:8430233b-6103-48ce-ac09-6696193bbfae>
WARC-Concurrent-To: <urn:uuid:0baacfbc-c48d-47cb-bfc3-824c1021a2f7>
WARC-IP-Address: 194.169.239.69
WARC-Target-URI: http://www.pueblosecreto.com/Net/profile/view_profile.aspx?MemberId=19356
WARC-Payload-Digest: sha1:6VPQDT5R3V2QOU4LCJUF6IGNSP43HPQU
WARC-Block-Digest: sha1:ULV74MAKKJVE2GSFLQVT6RJGIZDNHU26
WARC-Truncated: length
HTTP/1.1 200 OK
Cache-Control: no-cache
Pragma: no-cache
Content-Type: text/html; charset=utf-8
Expires: -1
Server: Microsoft-IIS/7.5
X-Powered-By: UrlRewriter.NET 2.0.0
Set-Cookie: ASP.NET_SessionId=0zwqewxrnzz1smnj54tbhi42; path=/; HttpOnly
Date: Sun, 31 Jul 2016 00:21:51 GMT
Connection: close
Content-Length: 1076247
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html id="ctl00_masterHtmlRoot" xmlns="http://www.w3.org/1999/xhtml">
...
<style type="text/css">
body {
background-image: url('''... (524288 apostrophes in total!) ...'''http://i261.photobucket.com/albums/ii67/iglup/l10/1280/foto_74967ac2.jpg'''... (515996 apostrophes in total!) ...'''
There are over 500,000 apostrophes before and after the URL! The document is clipped because of the max. document size configured for the crawler (1 MB) - the clipping includes also the closing parenthesis. The regular expression used to extract URLs from CSS is forced to heavily back-track.
Fixed by allowing only one of "
, '
, \"
or \'
in front of or after the URL.
One WARC file (
s3://commoncrawl/crawl-data/CC-MAIN-2016-30/segments/1469258944256.88/warc/CC-MAIN-20160723072904-00218-ip-10-185-27-174.ec2.internal.warc.gz
) of the July crawl makes the WEATGenerator hanging for hours. This happens when processing record 91422 (91421 records already processed according to job counters):Attaching to the task JVM several times with 2 hours shows the following stack (only calls inside java.util.regex vary):