Tjatse / node-readability

Scrape/Crawl article from any site automatically. Make any web page readable, no matter Chinese or English.
341 stars 36 forks source link

Remove inline styles #3

Closed midudev closed 8 years ago

midudev commented 8 years ago

Inlined styles are maintained over the result of the content. Might it be better to rip inline styles off.

You can try with that url (it's from Spain): http://www.abc.es/deportes/futbol/20150923/abci-celta-barcelona-liga-201509232234.html

"content": "<p>El <span style=\"font-weight:bold;font-style:normal;\">Celta </span>es un equipo incómodo, ganarle no es fácil y exige un gasto físico muy desagradable. ...",

Example URLs: http://www.sport.es/es/noticias/real-madrid/higuain-niega-las-criticas-cristiano-ronaldo-4543758 http://www.abc.es/deportes/futbol/20150923/abci-celta-barcelona-liga-201509232234.html

midudev commented 8 years ago

Btw, amazing work. :+1: I hope to be able to collaborate as soon as I get some time!

Tjatse commented 8 years ago

Appreciate!!

midudev commented 8 years ago

I've added another example to the initial comment if that helps.

Tjatse commented 8 years ago

Gotcha, and thank you very much, I'll implement this ASAP (It's Mid Autumn Festival and National day this days, I'll back to my laptop on 10/8, sorry :) ).

midudev commented 8 years ago

No problem! Enjoy the holidays! ;)

Tjatse commented 8 years ago

Sorry for the delay, try to set tidyAttrs to true 1:

read('http://www.abc.es/deportes/futbol/20150923/abci-celta-barcelona-liga-201509232234.html', { tidyAttrs: true }, function(err, art, options, resp){
    if (err) {
    console.log('[ERROR]', err.message);
    return;
  }
  if (!art) {
    console.log('[WARNING] article not exist');
    return;
  }

  console.log('[INFO]', 'title:', art.title);
  console.log('[INFO]', 'content:', art.content);
});
Tjatse commented 8 years ago

feel free to reopen this or file a new one if problem still exists