airbnb / knowledge-repo

A next-generation curated knowledge sharing platform for data scientists and other technical professions.
Apache License 2.0
5.48k stars 689 forks source link

Can't login using GitHub OAuth #354

Closed harrdb12 closed 6 years ago

harrdb12 commented 6 years ago

Auto-reviewers: @NiharikaRay @matthewwardrop @earthmancash @danfrankj

Hi there. I am unable to login to the web interface using GitHub's Oauth system, despite the fact that I believed I've configured everything correctly. To wit, on GitHub's end I've put both the Homepage and Callback URL for OAuth applications to added the Client ID and Client Secret to my config_defaults.py file, and added to my SERVER_NAME variable. Everything seems to checkout, but I get returned to the web server with a 404 error, at the URL http:///auth/login/github/authorize?code=< code string >&state=< state > .

Even stranger, when I try to sign in using Debug, it sends me straight back to the web interface Feed without signing me in. In fact, the only way I'm able to sign in is to set 'SERVER_NAME = None'.

The only thing I can think of I am hosting the web server on a company machine, which I believe is only accessible via the company's network. But this doesn't completely explain to me the strange behavior of the Debug sign-on, since the server name is the same whether or not I hardcode it into the SERVER_NAME variable.

I was wondering if someone is able to offer some insight or suggest a workaround for my problem. Thank you.

matthewwardrop commented 6 years ago

Hi @harrdb12 ,

It sounds like you have things set up correctly, so the only thing that I can see possibly going wrong is the value of the SERVER_NAME config variable.

If SERVER_NAME is specified, flask will not serve any content via any other server name. So if you specify it to be '127.0.0.1', then accessing the server via 'localhost' will not work (nor would '127.0.0.2', etc), even if technically they resolve to the same socket.

Does this help? If not, can you provide a little more detail about your setup [feel free to use a substitution cipher on sensitive internal IPs or hostnames]?

harrdb12 commented 6 years ago

Hi @matthewwardrop , I will try to include more details.

For the SERVER_NAME I have tried using both the hostname and the IP address (and of course leaving the variable to None). In all cases the port I'm using is 8080, so that explicitly e.g. SERVER_NAME = T1H6AY5Y2OV:8080 or SERVER_NAME = 24.27.290.68:8080. In both cases, I am able to access the web server, so I don't believe SERVER_NAME is specified incorrectly.

I am also using the Enterprise version of GitHub, but I believe I've already configured everything correctly in the oauth2.py file, so I doubt that's relevant. Regardless, my feeling is the problem has something to do with the fact that I can't login successfully under Debug when SERVER_NAME is specified; as I mentioned, I am simply redirected back to the knowledge repo feed (http:/ /T1H6AY5Y2OV:8080/feed) without being logged in.

For the record, I'm using the latest version of the Knowledge Repo which I cloned directly from the GitHub and installed using pip install --upgrade . in the git directory using python 3.5. To deploy the web server, I'm using knowledge_repo deploy --port 8080 --engine flask. For some reason, I get an error about unsafe threads if I leave out the --engine flask line, which is something I didn't have when using an older installation (which I installed using pip install --upgrade knowledge-repo).

At the moment, I can simply use the Debug without setting SERVER_NAME and ask that users use a consistent username; if I'm the only one who has this problem, then I'll assume it has something unique to do with my situation, but I wanted to run this by you all to make sure I didn't miss something.

harrdb12 commented 6 years ago

Came across something surprising to me as I was testing deploying the server on my local machine... apparently the only way I can log in with SERVER_NAME set is if it is set to localhost and any port number. Some results form my tests:

0.0.0.0 : No connection can be made here.

127.0.0.1: Connections can be made here, but it is not possible to log in with SERVER_NAME set.

TEDRYZOMZWQZ1: Connections can be made to the computer name, but again not possible to log in.

localhost: Strangely enough, it is possible to both make a connection and log in using localhost as the server name. I don't know why, but I'm guessing because it isn't a valid web address.

Edit: I've further tested trying to authorize using Google, adding:

OAUTH_GOOGLE_CLIENT_ID = 'BOKBOR9TOIMK-THKAIX8YTKLMII3B8ME9VTMYACSE7GSV.apps.googleusercontent.com'
OAUTH_GOOGLE_CLIENT_SECRET = 'A8DDEE776WHI5OG47DYUKIOD'

to my config_defaults.py file and adding http://localhost/auth/login/google/authorize to my OAuth2 google authentication URI. I see a very similar error that that of GitHub: I get a Knowledge Repo 404 error with 'Uh Oh! Looks like something broke...' at a URL of http://localhost/auth/login/google/authorize?state=84YANCW08AZHR1W50QLWEII2VR20O2&code=K/6AC2X7VJQKLX8OH02OM8L-O_FMENXL9QC2FU8TD9ARL#

harrdb12 commented 6 years ago

Hi @matthewwardrop . I've figured out the issue and am now able to login via GitHub successfully, so you can close this issue. Here are some of the things I've found; I mostly modified the oauth2.py file:

There were a couple other changed I had to make, but I'd rather not share them at the moment and don't think they'd be helpful in general anyway. I will recommend to others to take advantage of the werkzeug debugger, using export WERKZEUG_DEBUG_PIN=off in your shell file if you are only testing locally.

matthewwardrop commented 6 years ago

@harrdb12 Great news! Sorry for the delay... I'm on paternity leave, and the kids don't leave much time to dig into this (and when I have time, I'm usually knackered!). Glad you got things sorted out, and thanks for the detailed write-up!

shashj199 commented 6 years ago

@harrdb12 @matthewwardrop I am facing similar issues as you. I am hosting the knowledge repo on my local machine. I am trying to use Google authentication. Apart from the config_defaults.py, do I need to change anything else in the code to make the authentication work? I am directly getting the feed without login.

Thanks

harrdb12 commented 6 years ago

@shashj199 Perhaps try editing your hosts file; IIRC google authentication requires that you use a valid website URL,so you may have to e.g. rename localhost so that it is a valid website; there is an example on StackedOverflow of how to do this. Apart from that, I'm not really sure; you really shouldn't need to edit any more on KnowledgeRepo's end; are you sure you have things configured correctly on google's end?

shashj199 commented 6 years ago

@harrdb12 This is the config file I am using.

import datetime
# ---------------------------------------------------
# Host configuration
# ---------------------------------------------------

# The server name is used by Flask to limit access to the
# served content to request to a particular domain. It
# is also used by some authentication providers (in particular
# OAuth providers) to advertise callback providers. If
# not provided, it is assumed in these contexts to be
# 'localhost:7000'. Be sure to specify this before deploying
# into production.
SERVER_NAME = '0.0.0.0:7000'

# The knowledge repository uses the secret key to sign user
# sessions. If not specified, a unique secret key will be
# generated every time the server starts up. If hosting
# in a multi-server environment, or you want sessions
# to persist accross server restarts, set this to something
# static.
SECRET_KEY = 'shashank'

# ---------------------------------------------------
# Debug configuration
# ---------------------------------------------------
DEBUG = False

# ---------------------------------------------------
# Database configuration
# ---------------------------------------------------
SQLALCHEMY_DATABASE_URI = 'mysql://user:password@localhost:3306/knowledge_repo_dec'
SQLALCHEMY_ECHO = False
SQLALCHEMY_TRACK_MODIFICATIONS = False

# Should the database tables be automatically created
DB_AUTO_CREATE = True

# Should the database be automatically migrated when updates exist
# Note: This is True by default if this configuration is not applied,
# under the assumption that if you are concerned about this file
# you are probably interested in minimising risk to stability and handling
# database upgrades manually. Manual database migrations can be
# performed using `knowledge_repo --repo <> db_upgrade ...`.
DB_AUTO_UPGRADE = False

# ---------------------------------------------------
# Authentication configuration
# ---------------------------------------------------
# Authentication providers allow users to sign into the Knowledge Repo
# in a variety of different ways. You can create your own subclass of
# `KnowledgeAuthProvider` and add either the instance or identifier
# used for that class below.
# By default, the knowledge repo offers: ['debug', 'bitbucket', 'github', 'google']
AUTH_PROVIDERS = ['google']

# If you are going to use a OAuth provider, you will need to specify client ids
# and private tokens. This can be done by instantiating instances of
# `OAuth2Provider` and adding them to the above list, or by specifying OAuth
# connection properties as demonstrated below for the GitHub authenticator.
# OAUTH_GITHUB_CLIENT_ID = '<client id>'
# OAUTH_GITHUB_CLIENT_SECRET = '<client_secret>'

OAUTH_GOOGLE_CLIENT_ID = 'someapi'
OAUTH_GOOGLE_CLIENT_SECRET = 'secret'

# You can also forgo a fully-fledged sign in process for users
# by hosting the knowledge repository behind a proxy server that
# pre-authenticates users, and adds the appropriate user identifier
# to the http headers of the request. If the headers are
# specified below, then they take precedence over any other forms
# of authentication. If they are specified but not populated, then
# the authentication flow will fall back to use any of the providers
# specified above.
AUTH_USER_IDENTIFIER_REQUEST_HEADER = None

# If the identifier used above needs some transformation to match the canonical
# identifier format used in this repository, you can specify a mapping using
# the below config option.
def AUTH_USER_IDENTIFIER_REQUEST_HEADER_MAPPING(identifier):
    return identifier

# If the server desires to modify the attributes of the `User` object associated with
# users logged in via any of the above authentication providers, it can do so via
# this configuration key. This function will be run once at user login (if using
# an `AuthenticationProvider`, and then at most once during any caching lifetime
# period (as specified below). Note that attributes collected via
# `AuthenticationProvider`s will not be updated after initial login (user must
# relogin in order to reset those attributes).
def AUTH_USER_ATTRIBUTE_SETTER(user):
    return user

# The time to wait before re-checking user attributes with the above function
# for users logged in via request headers.
AUTH_USER_ATTRIBUTE_CACHE_LIFETIME = 24 * 60 * 60  # 1 day

# Once a user is logged in via an authentication provider, they will remain
# logged in via the use of cookies. By default, this cookie will last one year.
# This is managed by `flask_login`, but is copied here for convenience.
# For other options regarding sessions, please refer to:
# https://flask-login.readthedocs.io/en/latest/#cookie-settings
REMEMBER_COOKIE_DURATION = datetime.timedelta(days=365)

# ---------------------------------------------------
# Policy configuration
# ---------------------------------------------------
# This section configures various policy related to access control.

# Should anonymous users be able to view the post indices
POLICY_ANONYMOUS_VIEW_INDEX = True

# Should anonymous users be able to view the content of posts
POLICY_ANONYMOUS_VIEW_POST = True

# Should anonymous users be able to view overall statistics
POLICY_ANONYMOUS_VIEW_STATS = True

# ---------------------------------------------------
# Repository configuration
# ---------------------------------------------------
# You may specify a function `prepare_repo` which configures
# the repository upon which this server is running. This
# takes place after the repository has been instantiated
# and before the server is able to serve requests. It is
# possible to do anything to the repository, including
# substituting the repository for another one.
# By default, repositories manage their own configurations,
# but this can be risky as they may run arbitrary python code,
# which opens a vector for malicious users to compromise
# the server. If you want to avoid this risk, pass
# the '--safe' (TODO!) option to `knowledge_repo` config and
# manually configure the repository here.
# For example, if your server instance is sitting atop
# a meta-repository, it may make sense to update the meta-repository
# configuration with that of one of its children.
def prepare_repo(repo):
    return repo

# ---------------------------------------------------
# Repository Indexing configuration
# ---------------------------------------------------
# The Knowedge Repo updates the index of available posts on a regular basis.
# If the database is not thread-safe (i.e. in the case of SQLite), then the
# index will be updated on the main thread before every request that is more
# than `INDEX_INTERVAL` seconds after the last sync completed. Otherwise,
# indexing will occur every `INDEX_INTERVAL` seconds after the previous sync.
# Syncing is designed to be compatible with multiple instances of the Knowledge
# Repo connected to the same database, accross multiple machines and/or
# processes; and so a global indexing lock is employed. When a sync begins,
# a sync lock is put in place and the responsible process is considered to be
# the primary agent responsible for syncing until its last update is longer than
# `INDEXING_TIMEOUT` seconds, whereby the lock is ceded to the next requesting
# process. Note that `INDEXING_INTERVAL` must be larger than `INDEXING_TIMEOUT`
# or strange things might begin to happen.
INDEXING_INTERVAL = 5 * 60  # 5 minutes
INDEXING_TIMEOUT = 10 * 60  # 10 minutes

# Whether an index operation should update repositories
INDEXING_UPDATES_REPOSITORIES = True

# Whether repositories should be updated even without a sync lock (in which case
# the repositories will be updated on the sync timers, even if the relevant
# process/thread does not have a lock on updating the index). This is useful in
# context of multiple Knowledge Repo servers working together to serve the
# repositories across multiple machines, which each require repository syncing.
# Disable this if (for some reason) you have multiple Knowledge Repo servers
# running on the same machine, and you want to avoid potential clashes. This
# key is ignored if `INDEXING_UPDATES_REPOSITORIES` is False
INDEXING_UPDATES_REPOSITORIES_WITHOUT_LOCK = True

# In some cases you may want to disable indexing entirely, which is currently
# only ever used by the Knowledge Post previewer. Disabling the index means that
# posts will not be discoverable, but if know the path in the repository you can
# view the post with a direct link.
INDEXING_ENABLED = True

# ---------------------------------------------------
# Flask Mail Configuration
# Refer to https://pythonhosted.org/flask-mail/
# Unless specified, upstream defaults are used as indicated
# provided that MAIL_SERVER is defined.
# ---------------------------------------------------
# MAIL_SERVER = 'localhost'  # default = 'localhost'
# MAIL_PORT = 25  # default = 25
# MAIL_USE_TLS = False  # default = False
# MAIL_USE_SSL = False  # default = False
# MAIL_DEBUG = False  # default = app.debug
# MAIL_USERNAME = None  # default = None
# MAIL_PASSWORD = None  # default = None
# MAIL_DEFAULT_SENDER = None  # default = None
# MAIL_MAX_EMAILS = None  # default = None
# MAIL_SUPPRESS_SEND = False  # default = app.testing
# MAIL_ASCII_ATTACHMENTS = False  # default = False

# --------------------------------------------------
# Web Editor Configuration
# --------------------------------------------------
# The web editor can be limited to editing posts under
# a limited set of parent directories by setting
# WEB_EDITOR_PREFIXES to a list of supported path prefixes.
# e.g. ['webposts', 'projects']
WEB_EDITOR_PREFIXES = ['webposts']

# ---------------------------------------------------
# Tag configuration
# ---------------------------------------------------
# Posts with certain tags can be excluded from showing up
# in the app. This can be useful for security purposes
EXCLUDED_TAGS = ['private']

I am passing this file while deploying knowledge repo as

knowledge --repo repo/ deploy --config config_defaults.py --config server_config.py

My doubt is that knowledge repo is not even executing login code since I am able to access the feed anyway. Let me know what I am doing wrong.

Thanks

harrdb12 commented 6 years ago

@shashj199

You can access the feed regardless of whether or not you're logged in; logging in simply allows you to write webposts, subscribe to tags, etc. So that's not a good metric to figuring out what's wrong.

First, I would try to get the 'Debug' authorization working. If you're trying to use 'Google' authorization, AFAIK, you need to also set that up on Google's end, which it doesn't appear like you've done so; I can tell because your config file has SERVER_NAME = '0.0.0.0:7000' which Google doesn't like.

Also, are you sure you have the most updated version of knowledge repo? There ought to be a 'login' button where it prompts you to enter your credentials.

matthewwardrop commented 6 years ago

@harrdb12 Thanks so much for taking lead on this. I've been out of action on paternity leave, and will be for a bit yet. I really appreciate your heling @shashj199 out. In case it's not obvious, everything you have shared is right on the mark.

I'll just point out that you can secure the feed page also, by modifying the policy config options:

# ---------------------------------------------------
# Policy configuration
# ---------------------------------------------------
# This section configures various policy related to access control.

# Should anonymous users be able to view the post indices
POLICY_ANONYMOUS_VIEW_INDEX = True

# Should anonymous users be able to view the content of posts
POLICY_ANONYMOUS_VIEW_POST = True

# Should anonymous users be able to view overall statistics
POLICY_ANONYMOUS_VIEW_STATS = True
mkolarek commented 6 years ago

@harrdb12 thank you so much for the detailed write up, I'm also trying to deploy the knowledge-repo (latest clone from master) with github enterprise authentication. I'm not sure if I have everything set up properly though, could you maybe share the details of your configuration? And what did you set up as the 'Authorization callback URL' in the developer applications setup in github?

harrdb12 commented 6 years ago

If you are using GitHub enterprise, in addition to the normal setup required in the app/config_defaults.py file, you will also need to modify the app/auth_providers/oauth2.py file: it's mostly just replacing the github url with your Enterprise github url for base, token, and auto_refresh url; for the base_url, you will probably need to modify it so that it looks like the example https://developer.github.com/v3/enterprise/ .

The authorization callback URL can simply be the URL of wherever you're hosting the knowledge repo; in other words, the same as the homepage URL.

shashj199 commented 6 years ago

@matthewwardrop @harrdb12 Thanks for all the help so far. Sorry I had been out of touch with knowledge repo for a while. I followed steps suggested by @harrdb12 and was able to at least get github authentication to work (but I got an error). It was similar to what @harrdb12 had observed 'Uh Oh! Looks like something broke...'. I am posting the error here. I also tried @harrdb12 's solution by changing extract from dict function but got the same error.

[2018-01-19 20:21:46,452] ERROR in app: Exception on /auth/login/github/authorize [GET]
Traceback (most recent call last):
  File "/Users/shashank/Envs/knowledgerepo_jan_suggest/lib/python2.7/site-packages/flask/app.py", line 1982, in wsgi_app
    response = self.full_dispatch_request()
  File "/Users/shashank/Envs/knowledgerepo_jan_suggest/lib/python2.7/site-packages/flask/app.py", line 1614, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/Users/shashank/Envs/knowledgerepo_jan_suggest/lib/python2.7/site-packages/flask/app.py", line 1517, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/Users/shashank/Envs/knowledgerepo_jan_suggest/lib/python2.7/site-packages/flask/app.py", line 1612, in full_dispatch_request
    rv = self.dispatch_request()
  File "/Users/shashank/Envs/knowledgerepo_jan_suggest/lib/python2.7/site-packages/flask/app.py", line 1598, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/Users/shashank/Envs/knowledgerepo_jan_suggest/lib/python2.7/site-packages/knowledge_repo/app/auth_provider.py", line 49, in authorize
    user = self.get_user()
  File "/Users/shashank/Envs/knowledgerepo_jan_suggest/lib/python2.7/site-packages/knowledge_repo/app/auth_providers/oauth2.py", line 132, in get_user
    return self.extract_user_from_api()
  File "/Users/shashank/Envs/knowledgerepo_jan_suggest/lib/python2.7/site-packages/knowledge_repo/app/auth_providers/oauth2.py", line 159, in extract_user_from_api
    raise RuntimeError("Failure to extract user information from:\n\n {}".format(response.content))
RuntimeError: Failure to extract user information from:

 {"login":"test","id":12345,"avatar_url":"https://avatars1.githubusercontent.com/u/12345?v=4","gravatar_id":"","url":"https://api.github.com/users/test","html_url":"https://github.com/test","followers_url":"https://api.github.com/users/test/followers","following_url":"https://api.github.com/users/test/following{/other_user}","gists_url":"https://api.github.com/users/test/gists{/gist_id}","starred_url":"https://api.github.com/users/test/starred{/owner}{/repo}","subscriptions_url":"https://api.github.com/users/test/subscriptions","organizations_url":"https://api.github.com/users/test/orgs","repos_url":"https://api.github.com/users/test/repos","events_url":"https://api.github.com/users/test/events{/privacy}","received_events_url":"https://api.github.com/users/test/received_events","type":"User","site_admin":false,"name":null,"company":null,"blog":"","location":null,"email":null,"hireable":null,"bio":null,"public_repos":1,"public_gists":0,"followers":0,"following":1,"created_at":"2016-11-24T11:51:25Z","updated_at":"2018-01-19T06:59:10Z"}
harrdb12 commented 6 years ago

Hi @shashj199, it's hard to say what the error could be from the traceback alone. I should also mention I am using Python 3 and am not a Python guru, but if you post your Oauth2.py file I may be able to help.

shashj199 commented 6 years ago

@harrdb12 Nevermind, I got the next best thing working, Google authentication with your suggestions.

harrdb12 commented 6 years ago

@shashj199 Great to hear it. From my experience, debugging the GitHub authentication seems necessary if your environment isn't exactly like the one it was tested successfully for.

One thing I noticed that may have been the problem: the email in your dictionary is null (i.e. "email":null), but neither my code nor the original code is prepared to deal with this. The original code seems to be buggy in cases where the user doesn't have a public email configured, whereas my code only works if the email is set to the special value None (i.e. "email":None). Not sure why you're getting different results, other than possibly Python 2. AFAIK, there is no special null keyword in python.

pveragen commented 6 years ago

For Github Oauth there is still a problem on _"extract_fromdict"

def extract_from_dict(d, key):
            if isinstance(key, (list, tuple)):
                if len(key) == 1:
                    key = key[0]
                else:
                    return extract_from_dict(d[key[0]], key[1:])
            if isinstance(key, str):
                return d[key]
            raise RuntimeError("Invalid key type: {}.".format(key))

the Github "Identifier" list of keys is of length 2 (['email', 'login']), so the else return a call to the same function with a d['email'] as first parameter instead of d. So the fix proposed by @harrdb12 works for github specific case.

ghost commented 4 years ago

even though this is marked closed; I hope this helps someone in that I'm pretty positive line 160 in oauth2.py needs to be changed from: return extract_from_dict(d[key[0]], key[1:]) (which makes no sense) to return extract_from_dict(d, key[1:]) Which I guess is some kind of logic to prefer the second option in line 35 as a authentication login: i.e. then github oauth defaults to login

Or change line 35 to only one of the options (email or login) instead of: 'identifier': ['email', 'login'], do: 'identifier': ['login'],

Either way line 160 is most definitely wrong and only effects Github Oauth.