codelucas / newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
https://goo.gl/VX41yK
MIT License
14.06k stars 2.11k forks source link

Website login functionality #587

Open nup002 opened 6 years ago

nup002 commented 6 years ago

This is not so much an issue as a suggestion/request. This suggestion came up as I needed to pull articles from a website with a paywall. I have a user, but could not for the life of me figure out how get a logged in session going in Newspaper.

If the solution is simple and I am just too daft to understand it, feel free to come with suggestions.

njwfish commented 6 years ago

bumping

nup002 commented 6 years ago

I think I have found a workaround. It simply requires all requests to be done with a Session object. You would first create a Session object, login to the website with the Session using a tool such as Selenium, and then pass the Session object to Newspaper to be used in all future requests.

Changes needed: In: configuration.py

  1. Add "session = requests.Session()" as a new parameter.

In: network.py

  1. add "session = config.session" to function "get_html_2XX_only()".
  2. Replace "response = requests.get" with "response = session.get" in function "get_html_2XX_only()".
  3. Add "self.session = config.session" to class "Mrequest", function "init()".
  4. Replace "self.resp = requests.get" with "self.resp = session.get" in class "Mrequest", function "send()"

When you have logged into your Session object with Selenium, you set the session parameter in Configuration.py to be this object.

chsuong commented 6 years ago

@nup002, what about a website that uses logins and cookies?

timzhangau commented 6 years ago

I would also like to know any good practice to use newspaper for website with login, especially oauth2 authentication

BastianZim commented 5 years ago

In case anyone finds this via Google, check out #668 as well, has some helpful suggestions.

karam93 commented 5 years ago

@nup002, could you post an example of the way you can stay log in and bypass cookies