Letractively / abot

Automatically exported from code.google.com/p/abot
Apache License 2.0
0 stars 0 forks source link

HtmlAgilityPack throws StackOverflowException on pages with lots of nested tags #77

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
HtmlAgilityPack throws StackOverflowException on pages with lots of nested 
tags. This occurs during HtmlDocument.LoadHtml(string). Attached 2 html files 
that if their content is loaded will throw a StackOverflowException.

Original issue reported on code.google.com by sjdir...@gmail.com on 8 Mar 2013 at 7:47

Attachments:

GoogleCodeExporter commented 9 years ago
This issue was closed by revision r281.

Original comment by sjdir...@gmail.com on 8 Mar 2013 at 8:00

GoogleCodeExporter commented 9 years ago
Patched html agility to fix this issue. Added 
HtmlDocument.OptionMaxNestedChildNodes that can be set to prevent 
StackOverflowExceptions that are caused by tons of nested tags. It will throw 
an ApplicationException with message "Document has more than X nested tags. 
This is likely due to the page not closing tags properly."

Usage...
HtmlDocument hapDoc = new HtmlDocument();
hapDoc.OptionMaxNestedChildNodes = 5000;
try
{
  hapDoc.LoadHtml(RawContent);
}
catch (Exception e)
{
  hapDoc.LoadHtml("");
}

Attached new HtmlAgilityPack.dll assembly. Will submit this patch to the 
HtmlAgilityPack project site.

Original comment by sjdir...@gmail.com on 8 Mar 2013 at 8:07

Attachments:

GoogleCodeExporter commented 9 years ago
Added all source and binary to the hap project site...

http://www.codeplex.com/site/users/view/sjdirect

Original comment by sjdir...@gmail.com on 8 Mar 2013 at 8:52

GoogleCodeExporter commented 9 years ago
Attached full patch zip submitted to hap project

Original comment by sjdir...@gmail.com on 8 Mar 2013 at 9:29

Attachments:

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
Hello, Although I am using - as recommended - your HtmlAgilityPack.dll , I am 
still getting the StackOverFlow exception, Please check the screenshot. 

Hope you can help in this.

Thanks in advance. 

Original comment by fastoka...@gmail.com on 3 Sep 2013 at 11:00

Attachments:

GoogleCodeExporter commented 9 years ago
Hi, can you narrow it down to a single page/url? HAP uses many stacks in its 
implementation. I only fixed one related to nested html tags, it is likely that 
there are other conditions that can cause stackoverflows.

Original comment by sjdir...@gmail.com on 3 Sep 2013 at 3:48

GoogleCodeExporter commented 9 years ago
Hello,
I was applying the crawler to the following site : 
http://www.gesetze-im-internet.de/aktuell.html , getting the xmls within it, 
its over 200 000 pages with nested html Tags.
Somehow I think its related with VisualStudio Stack, I will test this today, 
just wanted to let you know :) 

Original comment by fastoka...@gmail.com on 4 Sep 2013 at 6:43

GoogleCodeExporter commented 9 years ago
Turns out that using HtmlDocument.OptionFixNestedTags = true solves this issue 
without needing the patched version..

Original comment by sjdir...@gmail.com on 10 Jul 2015 at 6:13