commoncrawl / ia-web-commons

Web archiving utility library
Apache License 2.0
9 stars 6 forks source link

WEATGenerator hanging while matching cssUrlPattern #2

Closed sebastian-nagel closed 7 years ago

sebastian-nagel commented 7 years ago

One WARC file (s3://commoncrawl/crawl-data/CC-MAIN-2016-30/segments/1469258944256.88/warc/CC-MAIN-20160723072904-00218-ip-10-185-27-174.ec2.internal.warc.gz) of the July crawl makes the WEATGenerator hanging for hours. This happens when processing record 91422 (91421 records already processed according to job counters):

...
2016-08-04 16:47:09,085 INFO [main] org.archive.hadoop.jobs.WEATGenerator: Start: s3a://commoncrawl/crawl-data/CC-MAIN-2016-30/segments/1469258944256.88/warc/CC-MAIN-20160723072904-00218-ip-10-185-27-174.ec2.internal.warc.gz
2016-08-04 16:47:09,086 INFO [main] org.archive.hadoop.jobs.WEATGenerator: About to write out to s3a://commoncrawl/crawl-data/CC-MAIN-2016-30/segments/1469258944256.88/wat/CC-MAIN-20160723072904-00218-ip-10-185-27-174.ec2.internal.warc.wat.gz and s3a://commoncrawl/crawl-data/CC-MAIN-2016-30/segments/1469258944256.88/wet/CC-MAIN-20160723072904-00218-ip-10-185-27-174.ec2.internal.warc.wet.gz
...
2016-08-04 16:55:10,741 INFO [main] org.archive.hadoop.jobs.WEATGenerator: Outputting new record 91000

Attaching to the task JVM several times with 2 hours shows the following stack (only calls inside java.util.regex vary):

  at java.util.regex.Matcher.find(Matcher.java:592)
  at org.archive.resource.html.ExtractingParseObserver.patternCSSExtract(ExtractingParseObserver.java:417)
  at org.archive.resource.html.ExtractingParseObserver.handleStyleNode(ExtractingParseObserver.java:201)
sebastian-nagel commented 7 years ago

Found the reason in record 91421 (see modified last line in snippet):

WARC/1.0
WARC-Type: response
WARC-Date: 2016-07-31T00:21:52Z
WARC-Record-ID: <urn:uuid:1f3b4b65-48e1-4eb9-97ab-4da22b693739>
Content-Length: 1044320
Content-Type: application/http; msgtype=response
WARC-Warcinfo-ID: <urn:uuid:8430233b-6103-48ce-ac09-6696193bbfae>
WARC-Concurrent-To: <urn:uuid:0baacfbc-c48d-47cb-bfc3-824c1021a2f7>
WARC-IP-Address: 194.169.239.69
WARC-Target-URI: http://www.pueblosecreto.com/Net/profile/view_profile.aspx?MemberId=19356
WARC-Payload-Digest: sha1:6VPQDT5R3V2QOU4LCJUF6IGNSP43HPQU
WARC-Block-Digest: sha1:ULV74MAKKJVE2GSFLQVT6RJGIZDNHU26
WARC-Truncated: length

HTTP/1.1 200 OK
Cache-Control: no-cache
Pragma: no-cache
Content-Type: text/html; charset=utf-8
Expires: -1
Server: Microsoft-IIS/7.5
X-Powered-By: UrlRewriter.NET 2.0.0
Set-Cookie: ASP.NET_SessionId=0zwqewxrnzz1smnj54tbhi42; path=/; HttpOnly
Date: Sun, 31 Jul 2016 00:21:51 GMT
Connection: close
Content-Length: 1076247

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html id="ctl00_masterHtmlRoot" xmlns="http://www.w3.org/1999/xhtml">

...

<style type="text/css">

   body {
     background-image: url('''... (524288 apostrophes in total!) ...'''http://i261.photobucket.com/albums/ii67/iglup/l10/1280/foto_74967ac2.jpg'''... (515996 apostrophes in total!) ...'''

There are over 500,000 apostrophes before and after the URL! The document is clipped because of the max. document size configured for the crawler (1 MB) - the clipping includes also the closing parenthesis. The regular expression used to extract URLs from CSS is forced to heavily back-track.

sebastian-nagel commented 7 years ago

Fixed by allowing only one of ", ', \" or \' in front of or after the URL.