Tjatse / node-readability

Scrape/Crawl article from any site automatically. Make any web page readable, no matter Chinese or English.
341 stars 36 forks source link

Links are removed if <p> is missing #5

Closed midudev closed 8 years ago

midudev commented 8 years ago

Solved the mystery about the charset I came along with another problem with the same URL.

http://www.elimparcial.com/EdicionEnlinea/Notas/Sonora/22092015/1010394-Firma-CPA-convenio-con-Cofemer.html

With the same article, and forcing the charset to use 'utf-8' some links are missing.The reason behind, I guess, is that the article is missing <p> for the content and is using instead <div> then the links inside are being treated as child nodes and the links doesn't have enough length to be putted.

I tried to add if ((tagName == 'span' || tagName == 'font' || tagName == 'a' ) && textLen > 0) but then the paragraph is closed before appending the link.

midudev commented 8 years ago

Another example for this behaviour with this link: http://www.macrumors.com/2015/10/19/mission-motors-ceases-operations-apple/

Tjatse commented 8 years ago

thanks, I just wanna handle this in an elegant way :)

midudev commented 8 years ago

:+1: Perfect! :)

Tjatse commented 8 years ago

This should be fixed via v0.4.4, feel free to reopen this or file a new one if the issue still exists.