AndyTheFactory / newspaper4k

📰 Newspaper4k a fork of the beloved Newspaper3k. Extraction of articles, titles, and metadata from news websites.
MIT License
483 stars 49 forks source link

javascript showing up as extracted article text #231

Closed AndyTheFactory closed 11 months ago

AndyTheFactory commented 1 year ago

Issue by xoffey Sat Jul 28 01:09:40 2018 Originally opened as https://github.com/codelucas/newspaper/issues/603


When processing sfgate.com, 15 articles recently show up with large amounts of javascript code, instead of the real text of the article. This happens frequently.

One tell-tale sign is that they all contain the string "window._taboola = window._taboola" I inspected 2 of them closely and found that they have lengthy javascript segments coded as elements rather than as Githubissues.

  • Githubissues is a development platform for aggregating issues.