Closed jcruzmartini closed 5 years ago
thanks @jcruzmartini, this is not a bug as such, it's just that when fetching the whole documents (209 and 350Kb) the conversion from the JSoup document to DocumentFragments involves a lot of recursion and crashes the stack limit. I managed to parse the 2 urls you gave by setting -Xss10M as VM arguments.
One alternative would be to avoid the conversion to DocumentFragment altogether and simply send a Document object, both extend org.w3c.dom.Document so I could change the method signature in ParseFilter to use that instead.
Not sure what the advantages of using a DocumentFragment are but if the conversion is costly then maybe there isn't much point in doing it.
@jnioche many thanks for the support, this information is really useful , I will try to set a higher Xss value in my VM , will let you know how it goes. Probably 10M is too much, having in mind that we have too many threads running at same time...
@jcruzmartini I've opened a PR which should fix this, see #653 Would you mind giving it a try?
When crawling using high values in http.content.limit configuration property, we are getting this exception:
java.lang.StackOverflowError at org.jsoup.helper.StringUtil.stringBuilder(StringUtil.java:235) at org.jsoup.helper.StringUtil.normaliseWhitespace(StringUtil.java:143) at org.jsoup.nodes.TextNode.normaliseWhitespace(TextNode.java:135) at org.jsoup.nodes.TextNode.text(TextNode.java:46) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(JSoupDOMBuilder.java:142) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(JSoupDOMBuilder.java:136) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(JSoupDOMBuilder.java:136) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(JSoupDOMBuilder.java:136) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(JSoupDOMBuilder.java:136) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(JSoupDOMBuilder.java:136) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(JSoupDOMBuilder.java:136) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(JSoupDOMBuilder.java:136) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(JSoupDOMBuilder.java:136) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(JSoupDOMBuilder.java:136) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(JSoupDOMBuilder.java:136) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(JSoupDOMBuilder.java:136) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(JSoupDOMBuilder.java:136) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(JSoupDOMBuilder.java:136) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(JSoupDOMBuilder.java:136) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(JSoupDOMBuilder.java:136) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(JSoupDOMBuilder.java:136) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(JSoupDOMBuilder.java:136) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(JSoupDOMBuilder.java:136) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(JSoupDOMBuilder.java:136) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(JSoupDOMBuilder.java:136) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(JSoupDOMBuilder.java:136) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(JSoupDOMBuilder.java:136) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(JSoupDOMBuilder.java:136) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(JSoupDOMBuilder.java:136) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(JSoupDOMBuilder.java:136) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(JSoupDOMBuilder.java:136) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(JSoupDOMBuilder.java:136) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(JSoupDOMBuilder.java:136) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(JSoupDOMBuilder.java:136) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(JSoupDOMBuilder.java:136) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(JSoupDOMBuilder.java:136) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(JSoupDOMBuilder.java:136) at com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder.createDOM(
We tried
http.content.limit
in -1 and 10MB and we are getting same issue, also if we set in 65k (default value) exception disappear.Issue is reproducible using these urls