Tjatse / node-readability

Scrape/Crawl article from any site automatically. Make any web page readable, no matter Chinese or English.
341 stars 36 forks source link

Text display broken for economictimes articles #43

Open RaviBolla opened 7 years ago

RaviBolla commented 7 years ago

Problem in reading following url http://economictimes.indiatimes.com/news/politics-and-nation/justice-c-s-karnan-attends-to-chamber-related-work-at-calcutta-hc/articleshow/58110699.cms

html content:

<p class="read-art-extra-bonus">KOLKATA: Justice</p>
                        <a onclick="ga('send', 'event', 'ArticleShow', 'C S Karnan Click', 'In Article');" href="http://economictimes.indiatimes.com/topic/C-S-Karnan" target="_blank">C S Karnan</a>
                        <p class="read-art-extra-bonus">today went to his chamber at the</p>
                        <a onclick="ga('send', 'event', 'ArticleShow', 'Calcutta High Court Click', 'In Article');" href="http://economictimes.indiatimes.com/topic/Calcutta-High-Court" target="_blank">Calcutta High Court</a>
                        <p class="read-art-extra-bonus">for the first time since the</p>
                        <a onclick="ga('send', 'event', 'ArticleShow', 'Supreme Court Click', 'In Article');" href="http://economictimes.indiatimes.com/topic/Supreme-Court" target="_blank">Supreme Court</a>
                        <p class="read-art-extra-bonus">withdrew the judicial and administrative works from him on February 8.</p>

First paragraph text looks like this

KOLKATA: Justice

C S Karnan
today went to his chamber at the

Calcutta High Court
for the first time since the

Supreme Court
withdrew the judicial and administrative works from him on February 8.

This is broken because plain text within "div" elements are converted to "p" element here

Generally, plain text within div should be "span" element I think following change will fix the issue: child.replaceWith('<span class="' + extBonusKey + '">' + childDom.data + '</span>')