Open Sriram629009746 opened 6 months ago
Hi there,
make sure that you are not blocked by bbc - try:
import requests
response = requests.get("https://www.bbc.com/news/business-67470876")
print(response.status_code)
print(response.text)
also check if the article object has some content in the html
property
can you also try the same code with v 0.9.3?
Thank you for responding.
I am not blocked by the BBC. Most other fields are non empty like title, authors and even html which seems to have the text of the article. Just the 'text' field in the output which used to be populated earlier is now empty.
I noticed that the website UI of the BBC has changed. I think that could be the reason for this issue.
I am unable to import newspaper after installing the v 0.9.3. There seems to be some dependency issue which I am trying to figure out. But I don't think this could be the problem.
I am unable to import newspaper after installing the v 0.9.3. There seems to be some dependency issue which I am trying to figure out. But I don't think this could be the problem.
What's the error. I did in fact change how dependencies are installed.
I am not blocked by the BBC. Most other fields are non empty like title, authors and even html which seems to have the text of the article. Just the 'text' field in the output which used to be populated earlier is now empty.
You are right, there seems to be an issue with bbc. I will investigate it
Yeah, there is a problem. It seems that bbc.com is now just dynamically rendered, there page is constructed with javascript after it loads. Here, you can see that there are not text elements to render without javascript: https://www.textise.net/showText.aspx?strURL=https%253A//www.bbc.com/news/business-67470876
Quick Fixes:
Anyway i will think about other alternative solutions
The html component of the response seems to have the text content although not in a contiguous paragraph form. Maybe that is something to look at.
I tried v0.9.3 on google colab. Regarding the error while importing newspaper, this is what I got when I did pip install:
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. bigframes 0.24.0 requires pandas<2.1.4,>=1.5.0, but you have pandas 2.2.1 which is incompatible. google-colab 1.0.0 requires pandas==1.5.3, but you have pandas 2.2.1 which is incompatible. Successfully installed feedparser-6.0.11 newspaper4k-0.9.3 numpy-1.26.4 pandas-2.2.1 requests-file-2.0.0 sgmllib3k-1.0.0 tldextract-5.1.1 tzdata-2024.1
When I try to import after installation, I get this error:
On subject of bbc and dates why does bbc article prepend date to _text string now?
e.g.
Published\n\n8 March\n\n
source https://www.bbc.co.uk/news/uk-england-london-68511760
interesting about playwright and textise.
Content for bbc seems to be in the main page render as well as attached via some nonce window object.
e.g.
<script nonce>
window.__INITIAL_DATA__={}
</script>
First please check that it is really an issue with the library, and not some special case of website:
[ ] There is no paywall [ ] You do not have to be logged in to see the articles [ ] You tried using a common browser user agent in your configuration / call [ ] The website is not in the list of well known problematic sites
Your report as follows:
Website that does not parse correctly:
Some sample urls that I have tried
The exact code i used to test this articles/website
What parts of the article are missing / not parsed correctly [ ] Text Content
Other information, remarks, messages, etc: It was working until a few days ago. I am using the package with version 0.9.2