Closed GoogleCodeExporter closed 8 years ago
I've noticed this too. When extracting an article, such as
http://news.bbc.co.uk/1/hi/world/asia-
pacific/8475965.stm, using the ArticleExtractor, sentences with leading double
quotation marks are missing the
quotation marks in the output.
Thanks for great work though - this an incredibly useful library.
Original comment by tom%tomt...@gtempaccount.com
on 24 Jan 2010 at 11:19
Here is a sample html file, which demonstrates the bug.
Original comment by ned...@googlemail.com
on 24 Jan 2010 at 1:14
Attachments:
Hi Tom,
hi nedunk,
thanks for reporting. I think I have found the bug. It was caused by an
optimization in the DefaultHTMLParser.
Please take the most recent version of that class from SVN and see if it works
for you.
nedunk: Your sample HTML file actually triggered another bug, which I have also
fixed. An internal buffer was
not flushed when there was actually no single block-level HTML tag in the BODY.
The bugs are actually triggered for any extractor (i.e. even
KeepEverythingExtractor). Please note that
DefaultExtractor will not output any text for nedunk's sample file by
definition (the file does not contain any
"long text" to extract).
Again, thanks for reporting!
Cheers,
Christian
Original comment by ckkohl79
on 24 Jan 2010 at 3:02
Original comment by ckkohl79
on 24 Jan 2010 at 3:43
Original comment by ckkohl79
on 24 Jan 2010 at 4:11
Thanks, that seems to work for me.
Original comment by tom%tomt...@gtempaccount.com
on 24 Jan 2010 at 4:43
Original comment by ckkohl79
on 30 Jan 2010 at 9:07
Original issue reported on code.google.com by
ned...@googlemail.com
on 4 Jan 2010 at 5:18