DefaultExtractor.INSTANCE.getText(html): Removes leading special charcater when it is coded in ascii

adrianhust / boilerpipe

Automatically exported from code.google.com/p/boilerpipe

0 stars 0 forks source link

DefaultExtractor.INSTANCE.getText(html): Removes leading special charcater when it is coded in ascii #1

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago

What steps will reproduce the problem?

DefaultExtractor.INSTANCE.getText(html):

When "html" contains a word with leading special char which is coded in
ascii like "Überprüfung"  -> &#220;berpr&#252;fung

getText() returns only berpr&#252;fung 

What version of the product are you using? On what operating system?
Version 1.0.2 on Linux

Original issue reported on code.google.com by ned...@googlemail.com on 4 Jan 2010 at 5:18

GoogleCodeExporter commented 8 years ago

I've noticed this too. When extracting an article, such as 
http://news.bbc.co.uk/1/hi/world/asia-
pacific/8475965.stm, using the ArticleExtractor, sentences with leading double 
quotation marks are missing the 
quotation marks in the output.

Thanks for great work though - this an incredibly useful library.

Original comment by tom%tomt...@gtempaccount.com on 24 Jan 2010 at 11:19

GoogleCodeExporter commented 8 years ago

Here is a sample html file, which demonstrates the bug.

Original comment by ned...@googlemail.com on 24 Jan 2010 at 1:14

Attachments:

sample.html

GoogleCodeExporter commented 8 years ago

Hi Tom,
hi nedunk,

thanks for reporting. I think I have found the bug. It was caused by an 
optimization in the DefaultHTMLParser. 

Please take the most recent version of that class from SVN and see if it works 
for you.

nedunk: Your sample HTML file actually triggered another bug, which I have also 
fixed. An internal buffer was 
not flushed when there was actually no single block-level HTML tag in the BODY.

The bugs are actually triggered for any extractor (i.e. even 
KeepEverythingExtractor). Please note that 
DefaultExtractor will not output any text for nedunk's sample file by 
definition (the file does not contain any 
"long text" to extract).

Again, thanks for reporting!

Cheers,
Christian

Original comment by ckkohl79 on 24 Jan 2010 at 3:02

Added labels: OpSys-All

GoogleCodeExporter commented 8 years ago

Original comment by ckkohl79 on 24 Jan 2010 at 3:43

Changed state: Accepted

GoogleCodeExporter commented 8 years ago

Original comment by ckkohl79 on 24 Jan 2010 at 4:11

Changed state: Fixed

GoogleCodeExporter commented 8 years ago

Thanks, that seems to work for me.

Original comment by tom%tomt...@gtempaccount.com on 24 Jan 2010 at 4:43

GoogleCodeExporter commented 8 years ago

Original comment by ckkohl79 on 30 Jan 2010 at 9:07

Changed state: Verified