aschaeffer / boilerpipe

Automatically exported from code.google.com/p/boilerpipe
0 stars 0 forks source link

Precursory header tags missing #17

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Use the HTMLHighlighter to extract the relevant html-code from a page:
   final BoilerpipeExtractor extractor = CommonExtractors.ARTICLE_EXTRACTOR;
   final HTMLHighlighter hh = HTMLHighlighter.newExtractingInstance();
   System.out.println(hh.process(url, extractor));
2. Try to parse this page: http://www.golem.de/1102/81290.html

What is the expected output? What do you see instead?
This should be the output:
<H2>
Daniel Domscheit-Berg
</H2>
<H1>
Wikileaks-Aussteiger haben Unterlagen mitgenommen
</H1>
...

But actually I get this:
Daniel Domscheit-Berg
</H2>
Wikileaks-Aussteiger haben Unterlagen mitgenommen
</H1>
...

What version of the product are you using? On what operating system?
- Boilerplate 1.1.0 binary
- OS: Suse

Is it possible to generate exactly the output which the Web API produces? There 
are even other tags which seem to be missing like <TABLE> and <TD>.

Original issue reported on code.google.com by chuckyth...@googlemail.com on 10 Feb 2011 at 9:07

GoogleCodeExporter commented 9 years ago
Actually this issue also affects other elements (sometimes) - like "<p>".
Have a look at this page: 
- 
http://www.n-tv.de/politik/Bewaehrungsstrafe-fuer-Tims-Vater-article2575771.html

I think that somehow the detection of text blocks must be buggy.

Original comment by chuckyth...@googlemail.com on 10 Feb 2011 at 10:51

GoogleCodeExporter commented 9 years ago
Thanks for reporting.

This bug has been fixed in boilerpipe 1.2, which will be released in the next 
few days.

Original comment by ckkohl79 on 10 Feb 2011 at 10:53

GoogleCodeExporter commented 9 years ago
Superb! I guess you are already using the fixed version for your Website API, 
then. I'm already looking forward to giving the new release a trial!

Original comment by chuckyth...@googlemail.com on 10 Feb 2011 at 11:14