commoncrawl / ia-web-commons

Web archiving utility library
Apache License 2.0
9 stars 6 forks source link

StringIndexOutOfBoundsException during WAT/WET generation #1

Closed sebastian-nagel closed 8 years ago

sebastian-nagel commented 8 years ago

The WEATGenerator chokes on some WARC fails and fails with a StringIndexOutOfBoundsException thrown by ExtractingParseObserver.

... 16/07/04 08:18:53 INFO jobs.WEATGenerator: Add input path: s3a://commoncrawl/crawl-data/CC-MAIN-2016-26/segments/1466783392527.68/warc/CC-MAIN-20160624154952-00042-ip-10-164-35-72.ec2.internal.warc.gz ... 16/07/04 08:18:58 INFO mapreduce.Job: Running job: job_1466588320333_0319 ... 16/07/04 08:30:42 INFO mapreduce.Job: Task Id : attempt_1466588320333_0319_m_000000_0, Status : FAILED Error: java.io.IOException: java.lang.StringIndexOutOfBoundsException: String index out of range: -1 at org.archive.hadoop.jobs.WEATGenerator$WEATGeneratorMapper.map(WEATGenerator.java:126) at org.archive.hadoop.jobs.WEATGenerator$WEATGeneratorMapper.map(WEATGenerator.java:48) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: -1 at java.lang.String.substring(String.java:1911) at org.archive.resource.html.ExtractingParseObserver.patternCSSExtract(ExtractingParseObserver.java:447) at org.archive.resource.html.ExtractingParseObserver.handleStyleNode(ExtractingParseObserver.java:201) at org.archive.format.text.html.LexParser.doParse(LexParser.java:36) at org.archive.format.text.html.LexParser.doParse(LexParser.java:18) at org.archive.resource.html.HTMLResourceFactory.getResource(HTMLResourceFactory.java:31) at org.archive.extract.ExtractingResourceProducer.getNext(ExtractingResourceProducer.java:54) at org.archive.hadoop.jobs.WEATGenerator$WEATGeneratorMapper.map(WEATGenerator.java:108) ... 9 more

sebastian-nagel commented 8 years ago

For the WARC file the problem is caused by the following CSS snippet

#services .avia-logo-element-container img {
    filter: url(\"");
    filter: none;
    -webkit-filter: none;
}

The length check in the method patternCSSExtract is insufficient: if 4 characters are removed the URL must be at least 4 characters long:

  } else if (url.charAt(0) == '\\') {
     if(url.length() == 2)
       continue;
     url = url.substring(2, origUrlLength - 2);