AndyTheFactory / newspaper4k

📰 Newspaper4k a fork of the beloved Newspaper3k. Extraction of articles, titles, and metadata from news websites.

MIT License

434 stars 38 forks source link

[SITES] https://www.bbc.com #625

Open Sriram629009746 opened 6 months ago

Sriram629009746 commented 6 months ago

First please check that it is really an issue with the library, and not some special case of website:

[ ] There is no paywall [ ] You do not have to be logged in to see the articles [ ] You tried using a common browser user agent in your configuration / call [ ] The website is not in the list of well known problematic sites

Your report as follows:

Website that does not parse correctly:

https://www.bbc.com

Some sample urls that I have tried

https://www.bbc.com/news/world-australia-67832905 https://www.bbc.com/news/business-67470876

The exact code i used to test this articles/website


import newspaper

url = "https://www.bbc.com/news/business-67470876"
result = newspaper.article(url)
print(result.text)
# the text is empty

What parts of the article are missing / not parsed correctly [ ] Text Content

Other information, remarks, messages, etc: It was working until a few days ago. I am using the package with version 0.9.2

AndyTheFactory commented 6 months ago

Hi there,

make sure that you are not blocked by bbc - try:

import requests
response = requests.get("https://www.bbc.com/news/business-67470876")
print(response.status_code)
print(response.text)

also check if the article object has some content in the html property

can you also try the same code with v 0.9.3?

Sriram629009746 commented 6 months ago

Thank you for responding.

I am not blocked by the BBC. Most other fields are non empty like title, authors and even html which seems to have the text of the article. Just the 'text' field in the output which used to be populated earlier is now empty.

I noticed that the website UI of the BBC has changed. I think that could be the reason for this issue.

I am unable to import newspaper after installing the v 0.9.3. There seems to be some dependency issue which I am trying to figure out. But I don't think this could be the problem.

AndyTheFactory commented 6 months ago

I am unable to import newspaper after installing the v 0.9.3. There seems to be some dependency issue which I am trying to figure out. But I don't think this could be the problem.

What's the error. I did in fact change how dependencies are installed.

I am not blocked by the BBC. Most other fields are non empty like title, authors and even html which seems to have the text of the article. Just the 'text' field in the output which used to be populated earlier is now empty.

You are right, there seems to be an issue with bbc. I will investigate it

AndyTheFactory commented 6 months ago

Yeah, there is a problem. It seems that bbc.com is now just dynamically rendered, there page is constructed with javascript after it loads. Here, you can see that there are not text elements to render without javascript: https://www.textise.net/showText.aspx?strURL=https%253A//www.bbc.com/news/business-67470876

Quick Fixes:

Easiest. Replace bbc.com with bbc.co.uk - https://www.bbc.co.uk/news/business-67470876 works just fine
A little more cumbersome - use playwright to render the webpage before parsing it. Here is an example

Anyway i will think about other alternative solutions

Sriram629009746 commented 6 months ago

The html component of the response seems to have the text content although not in a contiguous paragraph form. Maybe that is something to look at.

I tried v0.9.3 on google colab. Regarding the error while importing newspaper, this is what I got when I did pip install:

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. bigframes 0.24.0 requires pandas<2.1.4,>=1.5.0, but you have pandas 2.2.1 which is incompatible. google-colab 1.0.0 requires pandas==1.5.3, but you have pandas 2.2.1 which is incompatible. Successfully installed feedparser-6.0.11 newspaper4k-0.9.3 numpy-1.26.4 pandas-2.2.1 requests-file-2.0.0 sgmllib3k-1.0.0 tldextract-5.1.1 tzdata-2024.1

When I try to import after installation, I get this error:

2dareis2do commented 5 months ago

On subject of bbc and dates why does bbc article prepend date to _text string now?

e.g.

Published\n\n8 March\n\n

source https://www.bbc.co.uk/news/uk-england-london-68511760

2dareis2do commented 5 months ago

interesting about playwright and textise.

Content for bbc seems to be in the main page render as well as attached via some nonce window object.

e.g.

<script nonce>
window.__INITIAL_DATA__={}
</script>