Tjatse / node-readability

Scrape/Crawl article from any site automatically. Make any web page readable, no matter Chinese or English.
341 stars 36 forks source link

Extra score for items with articleBody itemprop (no matter if article or div) #25

Closed midudev closed 2 years ago

midudev commented 8 years ago

Now initNode function in reader.js gives 20 points to article tag. There's an itemprop used on some publications that defines where is the article body directly, no matter the tag where is used.

<div itemprop="articleBody" class="article-body-text>
    <p>Article main body text... lorem ipsum...</p>
</div>

According to schema.org (https://schema.org/articleBody):

The actual body of the article. Usage: Over 1,000,000 domains

midudev commented 8 years ago

An example of publication using it: http://www.lequipe.fr/Tennis/Actualites/Open-d-australie-gael-monfils-s-arrete-en-quart-de-finale/628639

And as you can see, is using it inside a tag while the content is quite right.

Another one: http://www.elconfidencial.com/espana/comunidad-valenciana/2016-01-27/los-hombres-de-negro-del-caso-rus-destapan-contratos-amanados-por-100-millones_1142249/

<div class="news-body-center cms-format " itemprop="articleBody" id="news-body-center">

And as you can see, it's the perfect fit for the needs of the parser.

Tjatse commented 8 years ago

Pretty nice, thanks.

midudev commented 8 years ago

Thanks to you, keep up with the good work!