How can I use the newspaper package for extract content only?

codelucas / newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:

https://goo.gl/VX41yK

MIT License

14.19k stars 2.12k forks source link

How can I use the newspaper package for extract content only? #668

Open phongtnit opened 5 years ago

phongtnit commented 5 years ago

Hello,

Is it possible to use newspaper package for extract content and use other Python package for scrapping method like Scrapy? I means I want to use Scrapy for download an URL and paste the content to Newspaper for extracting title, content and other meta data. If possible, please guide me how to do that.

I see the code:

import requests import newspaper resp = requests.get("https://medium.com/s/story/white-power-kids-850fd7add39f") print(newspaper.fulltext(resp.text))

The above code will get clean content of the article. However, I want to get more attributes in Article object like article_html, top_image, keywords in following code. Please help me how to get these attributes.

url = "https://medium.com/s/story/white-power-kids-850fd7add39f" article = Article(url, keep_article_html=True) article.download() article.parse() print(article.title) print(article.article_html) print(article.top_image) print(article.keywords)

Thanks for your support,

0xteo commented 5 years ago

Hi, assuming that you have a scrapy Response object the following should work:

from newspaper import Article

article = Article(response.url)
article.download(input_html=response.body)
article = article.parse()

When passing input_html to article.download, no request is sent, rather the html you pass is used, as you can see here.

phongtnit commented 5 years ago

Thanks @TeodorIvanov for your awesome help, your code works well.

Jeafi commented 5 years ago

Hello, I try your code and i find that if it can not run well when I set a particular language like： ‘’‘ article = Article(response.url ，language = ‘zh’) article.download(input_html=response.body) ‘’‘’ Thanks for your support.

0xteo commented 5 years ago

Hi @Jeafi, what is the error that you are getting?

Jeafi commented 5 years ago

code like this: ''' from newspaper import Article response = requests.get('http://district.ce.cn/newarea/sddy/201812/26/t20181226_31117372.shtml') article = Article(response.url, language='zh') article.download(input_html=response.text) article.parse() ''' and I find article.text 、article.title are '[]'

and if i use： ''' url = "http://district.ce.cn/newarea/sddy/201812/26/t20181226_31117372.shtml" article = Article(url, language='zh') article.download() article.html ''' the result is prefect. So is the situation as other websits(Chinese language). I want to use newspaper in a scrapy project but it doesn't work well.Thank you for your replay and I am very glad if you can help.

phongtnit commented 5 years ago

code like this: ''' from newspaper import Article response = requests.get('http://district.ce.cn/newarea/sddy/201812/26/t20181226_31117372.shtml') article = Article(response.url, language='zh') article.download(input_html=response.text) article.parse() ''' and I find article.text 、article.title are '[]'

and if i use： ''' url = "http://district.ce.cn/newarea/sddy/201812/26/t20181226_31117372.shtml" article = Article(url, language='zh') article.download() article.html ''' the result is prefect. So is the situation as other websits(Chinese language). I want to use newspaper in a scrapy project but it doesn't work well.Thank you for your replay and I am very glad if you can help.

I think you do something wrong, but the Newspaper package works well with Scrapy. The following code get the title and content of your wanted url.

Scrapy spider

Jeafi commented 5 years ago

Thank you for help. It works well with scrapy. But I still do not know why it can work with natural requests in python. And it does not matter now.

phongtnit commented 5 years ago

Hi @Jeafi, we can use any Python package for download a website like requests, Scrapy or build-in method in the Newspaper package. So why you think Newspaper don’t work with requests package?

Jeafi commented 5 years ago

I may do something wrong.But you can try this code. ''' import requests from newspaper import Article response = requests.get('http://district.ce.cn/newarea/sddy/201812/26/t20181226_31117372.shtml') article = Article(response.url, language='zh') article.download(input_html=response.text) article.parse() ''' (:

phongtnit commented 5 years ago

I may do something wrong.But you can try this code. ''' import requests from newspaper import Article response = requests.get('http://district.ce.cn/newarea/sddy/201812/26/t20181226_31117372.shtml') article = Article(response.url, language='zh') article.download(input_html=response.text) article.parse() ''' (:

Hi @Jeafi, your code don't work because of encoding text. When you make a request, Requests makes educated guesses about the encoding of the response based on the HTTP headers. The text encoding guessed by Requests is used when you access r.text. You can find out what encoding Requests is using, and change it, using the r.encoding property.

So, to fix it, you can use the following code: response.encoding = response.apparent_encoding

Full fixed code

Jeafi commented 5 years ago

That‘s it. Thank you. 👍 👍 👍 👍 👍

senthilnayagam commented 5 years ago

for some of the site, I was able to get the public Wordpress API , so getting the content, title, modified date, etc. my other use case is want to cache the data into db for caching, or have crawling as a separate process

how can I pass this info into Article