dominiclovell / boilerpipe

Automatically exported from code.google.com/p/boilerpipe
0 stars 0 forks source link

Better support for non-english pages #16

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
I'm looking for a solution to parse pages that are non-english, which seems to 
give varying results with Boilerpipe. Here are a couple of examples where 
boilerpipe misses the main portion of text (tested with 
http://boilerpipe-web.appspot.com/ - 2011-01-06):

* 
http://www.dn.se/nyheter/vetenskap/annu-godare-choklad-med-hjalp-av-dna-teknik 
- picks up some teasers instead
* 
http://www.sydsvenskan.se/malmo/article1346121/I-natt-bargas-det---forhoppningsv
is.html - picks up the comment section
* http://www.dn.se/sthlm/tva-raddade-ur-malarvak - all sorts of content from 
around the article
* http://www.expressen.se/nyheter/1.2280178/smhi-utfardar-klass-2-varning - 
picks up the comment section

I also see minor artifacts from non-content sections throughout the extracted 
text:
* http://hd.se/skane/2011/01/06/mangder-med-sno-over-skane/ - "Skriv ut" is a 
link to print the article. "Bildmaterial" is a header from the sidebar"
* http://www.dn.se/sthlm/misstankt-brott-bakom-ung-mans-dod - "Dela med andra" 
is a header from the sidebar with sharing links
* http://www.expressen.se/noje/1.2280351/lotta-engberg-lamnar-bingolotto - 
Misses main header and teaser

I know it's hard to get all the above URL:s right without site-specific code, 
but I also know it's possible. I've run all of the URL:s above through 
readability.js, and it parses all of them without any artifacts. Maybe it's 
readabilities reliance on class names (which generally is in english even on 
foreign language sites) that makes it cope better. Problem is, readability.js 
is a mess to run server-side, and has not undergone the rigorous testing 
boilerpipe has, so I would much rather see boilerpip succeed that switch to 
readability.js.

Thanks for your hard work.

Original issue reported on code.google.com by EmilStenstrom on 6 Jan 2011 at 2:43

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
Here's an update from 2011-12-08 on the above URL:s, using the web version of 
boilerpipe:

* 
http://www.dn.se/nyheter/vetenskap/annu-godare-choklad-med-hjalp-av-dna-teknik 
- Misses the header altogether (dn.se has had a new design since then...)
* 
http://www.sydsvenskan.se/malmo/article1346121/I-natt-bargas-det---forhoppningsv
is.html - picks up the comment section
* http://www.dn.se/sthlm/tva-raddade-ur-malarvak - picks up some teasers 
instead of main text.
* http://www.expressen.se/nyheter/1.2280178/smhi-utfardar-klass-2-varning - One 
teaser, and various text from popups

Minor artifacts:
* http://hd.se/skane/2011/01/06/mangder-med-sno-over-skane/ -  - "Skriv ut" is 
a link to print the article. "Bildmaterial" is a header from the sidebar". 
"Dela" at the bottom is from the sharing feature
* http://www.dn.se/sthlm/misstankt-brott-bakom-ung-mans-dod - This one does no 
longer have any artifacts, well done!
* http://www.expressen.se/noje/1.2280351/lotta-engberg-lamnar-bingolotto - 
Misses main header and teaser

I don't know what magic Readability uses, but all of the above urls works 
perfectly with Readability.

Original comment by EmilStenstrom on 8 Dec 2011 at 9:08

GoogleCodeExporter commented 9 years ago
http://www.anspress.com/index.php?a=2&cid=48&lng=az&nid=270848

Original comment by eyusi...@gmail.com on 13 May 2014 at 1:44