codelucas / newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
https://goo.gl/VX41yK
MIT License
14.19k stars 2.12k forks source link

Error converting html to string. #904

Open tspier opened 3 years ago

tspier commented 3 years ago

I'm not sure if it's an issue with the HTML of the website, if there's an issue parsing Tajiki, or something else, but I tried scraping http://www.jumhuriyat.tj/index.php?art_id=44635 on the Heroku demo page and received the following notice: Error converting html to string.

giggioman00 commented 3 years ago

I'm getting the same errors on multiple sites

itskrsna commented 3 years ago

even i am also facing the same issue, is this repository running?

johnbumgarner commented 3 years ago

The article http://www.jumhuriyat.tj/index.php?art_id=44635 cannot be scraped with Newspaper3k. The reason is related to the structure of the HTML, which doesn't provide a clear block of article text to extract.

johnbumgarner commented 3 years ago

I'm getting the same errors on multiple sites

@giggioman00 What sites are giving you issues?

johnbumgarner commented 3 years ago

even i am also facing the same issue, is this repository running?

@blueshirtdeveloper What sites are giving you issues?

joshhoegen commented 2 years ago

Not sure how dead this thread is, but I'm getting the error on all articles from the following domains (whole article path for easy access):