commoncrawl / ia-web-commons

Web archiving utility library
Apache License 2.0
9 stars 6 forks source link

Java stack overflow while matching cssUrlPattern #12

Closed sebastian-nagel closed 7 years ago

sebastian-nagel commented 7 years ago

While matching a URL embedded in CSS as url(...) escaped with 8192 single quotes before and after the ExtractingParseObserver causes a stack overflow. See wat_wet_stack_overflow_test.warc.gz for the problematic WARC record.

java.lang.StackOverflowError
        at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3798)
        at java.util.regex.Pattern$Ques.match(Pattern.java:4182)
        at java.util.regex.Pattern$GroupHead.match(Pattern.java:4658)
        at java.util.regex.Pattern$Loop.match(Pattern.java:4785)
        at java.util.regex.Pattern$GroupTail.match(Pattern.java:4717)
        at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3798)
        at java.util.regex.Pattern$Ques.match(Pattern.java:4182)
... (16000 lines stripped)
        at java.util.regex.Pattern$Branch.match(Pattern.java:4604)
        at java.util.regex.Pattern$Start.match(Pattern.java:3461)
        at java.util.regex.Matcher.search(Matcher.java:1248)
        at java.util.regex.Matcher.find(Matcher.java:637)
        at java.util.regex.Matcher.replaceAll(Matcher.java:951)
        at org.archive.resource.html.ExtractingParseObserver.patternCSSExtract(ExtractingParseObserver.java:485)
        at org.archive.resource.html.ExtractingParseObserver.handleStyleNode(ExtractingParseObserver.java:233)
sebastian-nagel commented 7 years ago

Related to #2 which was also caused by an overlong sequence of quotes.