apache / incubator-stormcrawler

A scalable, mature and versatile web crawler based on Apache Storm
https://stormcrawler.apache.org/
Apache License 2.0
887 stars 262 forks source link

[bug] : StackOverFlow Exception when http.content.limit is high #644

Closed jcruzmartini closed 5 years ago

jcruzmartini commented 5 years ago

When crawling using high values in http.content.limit configuration property, we are getting this exception:

java.lang.StackOverflowError at org.jsoup.helper.StringUtil.stringBuilder(StringUtil.java:235) at org.jsoup.helper.StringUtil.normaliseWhitespace(StringUtil.java:143) at org.jsoup.nodes.TextNode.normaliseWhitespace(TextNode.java:135) at org.jsoup.nodes.TextNode.text(TextNode.java:46) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(JSoupDOMBuilder.java:142) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(JSoupDOMBuilder.java:136) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(JSoupDOMBuilder.java:136) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(JSoupDOMBuilder.java:136) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(JSoupDOMBuilder.java:136) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(JSoupDOMBuilder.java:136) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(JSoupDOMBuilder.java:136) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(JSoupDOMBuilder.java:136) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(JSoupDOMBuilder.java:136) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(JSoupDOMBuilder.java:136) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(JSoupDOMBuilder.java:136) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(JSoupDOMBuilder.java:136) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(JSoupDOMBuilder.java:136) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(JSoupDOMBuilder.java:136) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(JSoupDOMBuilder.java:136) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(JSoupDOMBuilder.java:136) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(JSoupDOMBuilder.java:136) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(JSoupDOMBuilder.java:136) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(JSoupDOMBuilder.java:136) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(JSoupDOMBuilder.java:136) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(JSoupDOMBuilder.java:136) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(JSoupDOMBuilder.java:136) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(JSoupDOMBuilder.java:136) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(JSoupDOMBuilder.java:136) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(JSoupDOMBuilder.java:136) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(JSoupDOMBuilder.java:136) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(JSoupDOMBuilder.java:136) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(JSoupDOMBuilder.java:136) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(JSoupDOMBuilder.java:136) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(JSoupDOMBuilder.java:136) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(JSoupDOMBuilder.java:136) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(JSoupDOMBuilder.java:136) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(JSoupDOMBuilder.java:136) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(

We tried http.content.limit in -1 and 10MB and we are getting same issue, also if we set in 65k (default value) exception disappear.

Issue is reproducible using these urls

jnioche commented 5 years ago

thanks @jcruzmartini, this is not a bug as such, it's just that when fetching the whole documents (209 and 350Kb) the conversion from the JSoup document to DocumentFragments involves a lot of recursion and crashes the stack limit. I managed to parse the 2 urls you gave by setting -Xss10M as VM arguments.

One alternative would be to avoid the conversion to DocumentFragment altogether and simply send a Document object, both extend org.w3c.dom.Document so I could change the method signature in ParseFilter to use that instead.

Not sure what the advantages of using a DocumentFragment are but if the conversion is costly then maybe there isn't much point in doing it.

jcruzmartini commented 5 years ago

@jnioche many thanks for the support, this information is really useful , I will try to set a higher Xss value in my VM , will let you know how it goes. Probably 10M is too much, having in mind that we have too many threads running at same time...

jnioche commented 5 years ago

@jcruzmartini I've opened a PR which should fix this, see #653 Would you mind giving it a try?