Tjatse / node-readability

Scrape/Crawl article from any site automatically. Make any web page readable, no matter Chinese or English.
341 stars 36 forks source link

Some titles are broken #23

Closed midudev closed 8 years ago

midudev commented 8 years ago

URL: http://www.abc.es/economia/abci-presidente-ceoe-reconoce-preocupacion-dificultad-para-formar-gobierno-201601271013_noticia.html

Title expected: El presidente de la CEOE reconoce su «preocupación» por la dificultad para formar gobierno

Title from read-art: El presidente de la CEOE reconoce su

The problem is with the < and > symbols that are being cut by searching a better title. It might be a good idea to only cut titles if those symbols are separating the first or last word (or near the beggining or the end of the string).

Tjatse commented 8 years ago

A new option will be exposed to let user customizing title extraction.

midudev commented 8 years ago

That would be nice, thanks!

Tjatse commented 8 years ago

Try the lastest ver. by:

read({
  betterTitle: function(title){
    return title;
  },
  // ...
})

or

read({
  betterTitle: 1000,
  // ...
})