Open phongtnit opened 5 years ago
Hi, assuming that you have a scrapy Response object the following should work:
from newspaper import Article
article = Article(response.url)
article.download(input_html=response.body)
article = article.parse()
When passing input_html to article.download, no request is sent, rather the html you pass is used, as you can see here.
Thanks @TeodorIvanov for your awesome help, your code works well.
Hello, I try your code and i find that if it can not run well when I set a particular language like: ‘’‘ article = Article(response.url ,language = ‘zh’) article.download(input_html=response.body) ‘’‘’ Thanks for your support.
Hi @Jeafi, what is the error that you are getting?
code like this: ''' from newspaper import Article response = requests.get('http://district.ce.cn/newarea/sddy/201812/26/t20181226_31117372.shtml') article = Article(response.url, language='zh') article.download(input_html=response.text) article.parse() ''' and I find article.text 、article.title are '[]'
and if i use: ''' url = "http://district.ce.cn/newarea/sddy/201812/26/t20181226_31117372.shtml" article = Article(url, language='zh') article.download() article.html ''' the result is prefect. So is the situation as other websits(Chinese language). I want to use newspaper in a scrapy project but it doesn't work well.Thank you for your replay and I am very glad if you can help.
code like this: ''' from newspaper import Article response = requests.get('http://district.ce.cn/newarea/sddy/201812/26/t20181226_31117372.shtml') article = Article(response.url, language='zh') article.download(input_html=response.text) article.parse() ''' and I find article.text 、article.title are '[]'
and if i use: ''' url = "http://district.ce.cn/newarea/sddy/201812/26/t20181226_31117372.shtml" article = Article(url, language='zh') article.download() article.html ''' the result is prefect. So is the situation as other websits(Chinese language). I want to use newspaper in a scrapy project but it doesn't work well.Thank you for your replay and I am very glad if you can help.
I think you do something wrong, but the Newspaper package works well with Scrapy. The following code get the title and content of your wanted url.
Thank you for help. It works well with scrapy. But I still do not know why it can work with natural requests in python. And it does not matter now.
Hi @Jeafi, we can use any Python package for download a website like requests, Scrapy or build-in method in the Newspaper package. So why you think Newspaper don’t work with requests package?
I may do something wrong.But you can try this code. ''' import requests from newspaper import Article response = requests.get('http://district.ce.cn/newarea/sddy/201812/26/t20181226_31117372.shtml') article = Article(response.url, language='zh') article.download(input_html=response.text) article.parse() ''' (:
I may do something wrong.But you can try this code. ''' import requests from newspaper import Article response = requests.get('http://district.ce.cn/newarea/sddy/201812/26/t20181226_31117372.shtml') article = Article(response.url, language='zh') article.download(input_html=response.text) article.parse() ''' (:
Hi @Jeafi, your code don't work because of encoding text. When you make a request, Requests makes educated guesses about the encoding of the response based on the HTTP headers. The text encoding guessed by Requests is used when you access r.text. You can find out what encoding Requests is using, and change it, using the r.encoding property.
So, to fix it, you can use the following code: response.encoding = response.apparent_encoding
That‘s it. Thank you. 👍 👍 👍 👍 👍
for some of the site, I was able to get the public Wordpress API , so getting the content, title, modified date, etc. my other use case is want to cache the data into db for caching, or have crawling as a separate process
how can I pass this info into Article
Hello,
Is it possible to use newspaper package for extract content and use other Python package for scrapping method like Scrapy? I means I want to use Scrapy for download an URL and paste the content to Newspaper for extracting title, content and other meta data. If possible, please guide me how to do that.
I see the code:
The above code will get clean content of the article. However, I want to get more attributes in Article object like article_html, top_image, keywords in following code. Please help me how to get these attributes.
Thanks for your support,