OHDSI / ROhdsiWebApi

An R package for interfacing with a WebAPI instance
https://ohdsi.github.io/ROhdsiWebApi
10 stars 17 forks source link

Idea for adding authentication to ROhdsiWebApi #141

Closed ablack3 closed 4 years ago

ablack3 commented 4 years ago

Overview

We would like to add the ability to authenticate to an instance of WebAPI so that this package and others that rely on in can be used when WebAPI security is enabled. However this must be done with careful thought since it will require lots of changes and must accommodate all possible modes of authentication and authorization to WebAPI.

Modes of Authentication

I'm still not clear on all the modes but here are some to consider.

In every case some kind of header needs to be added to POST, GET, and DELETE requests to let WebAPI know that the user is authenticated. This header token needs to be regenerated frequently as it will typically expire after 15 minutes (?).

Proposed solution

I'm proposing a solution that requires significant changes in this package. However I think this approach will accommodate all auth methods and allow for more features to be added in the future.

I'm effectively replacing baseUrl with a webApiConnnection object that contains the baseUrl along with other information including the header that needs to be added if authentication is in use. From the user's perspective this is conceptually similar to connecting to a database with DatabaseConnector::connect() . In the case of WebAPI the new connection object is similar to a connectionDetails object as well. However since the new object will contain a token that has a expiration I'm thinking of it more like a actual connection. Either way works though. A new function connectWebApi() has been added that will create the WebApiConnection object. connectWebApi() will perform the authentication and check that WebAPI is working before returning the connection object. The connection object would then be passed to the other functions in this package in place of baseUrl. connectWebApi() will support all authentication methods through the use of specific internal functions that perform each type of authentication. It will contain the baseUrl, the header to be added to httr calls as well as any other information we want to add in the future. It also supports to use of keyring as an easy way to securely pass in a password.

Future changes

Let's discuss this direction here and on issue #35 page. If we want to proceed with this approach I will need to essentially replace baseUrl with a WebApiConnection object, which I've called wc for short throghout the entire package. I also think it would be good design if this connection object was always the first argument to all the functions that use it. Argument order is important to function usability. Also any GET, POST, or DELETE requests will need to add the authHeader if one exists.

Minimal example

# auth example
# keyring::key_set("ATLAS") has been run at some future point to save a password
library(ROhdsiWebApi)
wc <- connectWebApi(baseUrl = "http://localhost/WebAPI",
                    authMethod = "db",
                    atlasUsername = "user@domain.com",
                    keyringServiceName = "ATLAS")

str(wc)
#> List of 3
#>  $ baseUrl   : chr "http://localhost/WebAPI"
#>  $ authMethod: chr "db"
#>  $ authHeader: chr "Bearer eyJhbGciOiJIUzUxMiJ9.eyJ9dWIiOiJhZG1pbkBkb219aW4uY29tIi9iZXhwIjoxNTkyNDgy8DExfQ.8rcpQcRr9Tpj6NVYqQpzz2ie"| __truncated__
#>  - attr(*, "class")= chr "WebApiConnection"

cohort5 <- getCohortDefinition(wc, 5)
str(cohort5, max.level = 1)
#> List of 9
#>  $ id            : int 5
#>  $ name          : chr "[LEGEND HTN] Coronary Heart Disease"
#>  $ description   : NULL
#>  $ createdBy     : chr "user@domain.com"
#>  $ createdDate   : chr "2020-06-05 14:39"
#>  $ modifiedBy    : chr ""
#>  $ modifiedDate  : NULL
#>  $ expression    :List of 11
#>  $ expressionType: chr "SIMPLE_EXPRESSION"

Created on 2020-06-18 by the reprex package (v0.3.0)

Let me know what you think about this plan! :)

ablack3 commented 4 years ago

The latest commits give a better idea of the breadth of changes that will happen if we take this route for adding authentication. It will definitely break existing code. However I think that any approach to adding authentication will require lots of changes. Open to all your thoughts and ideas. :)

ablack3 commented 4 years ago

An alternative approach could be to store the authHeader in an environment variable. Personally I like the idea of using a connection object and modeling then interface off of the way we interact with databases in R. I think it is generally better to pass data into functions explicitly. However using an environment variable would probably not require any changes to function signatures. Calls to PUT, GET, and DELETE would just pull the authHeader from the environment variable if it existed. The user would set the environment variable by calling a webApiAuthenticate function. Just throwing another option out there.

chrisknoll commented 4 years ago

I also like the idea of a connection object. I think env variables are a little harder to trace because they are valeus 'in the session' vs values that you provide explicitly. And, wouldn't an environment variable mean you can't handle 2 separate connections at the same time?

I was wondering if the API for this could be more 'connection oriented' but in a way that instead of passing the connection to each of the methods, there's a connection 'object' that is instantiated with the appropriate credentials, and the interace on the connection object talks through to the webapi:

wc <- createConnection(credentials, baseURI, etc)
wc.connect();
wc.getCohortDefinitions()
def <- wc.getCohortDefinition(id)
wc.saveCohortDefinition(def)
# make another connection
wc2 <- crreateConnection(otherCreds, otherBaseUri)
...
...

I wonder if in the context of the connection object's functions, it can store it's own sort of session-level variables that is available to all the underlying implementations so that the underlying functions don't need the connection sent to it, but on the other hand, it's under-the-covers of the API, so that level of the function definitions can be as ugly as convenient.

ablack3 commented 4 years ago

@chrisknoll Yes the global variable approach would only allow for a single connection per session. Do environment variables set in an R session linger after the session or are they deleted at the end of the session? I have not used them much.

I like what you're thinking! R has multiple OOP systems. In S3, the simplest one, methods belong to generic functions and not to objects. In R6 methods do belong to objects. Using R6 I think your example would look like

wc <- webApiConnection$new(credentials, baseURI, etc)
wc$connect();
wc$getCohortDefinitions()
def <- wc$getCohortDefinition(id)
wc$saveCohortDefinition(def)
# make another connection
wc2 <- webApiConnection$new(otherCreds, otherBaseUri)

However Advanced R recommends against using R6.

If you’ve learned OOP in another programming language, it’s likely that R6 will feel very natural, and you’ll be inclined to prefer it over S3. Resist the temptation to follow the path of least resistance: in most cases R6 will lead you to non-idiomatic R code. https://adv-r.hadley.nz/r6.html#introduction-13 https://adv-r.hadley.nz/oo-tradeoffs.html#s3-r6

My current direction in this pull request isn't so different and does not require R6.

wc <- createConnection(credentials, baseURI, etc)
getCohortDefinitions(wc)
def <- getCohortDefinition(wc, id)
# I think an action like saving a cohort def shouldn't require contacting webapi
saveCohortDefinition(def) 
# make another connection
wc2 <- createConnection(otherCreds, otherBaseUri)

Another benefit to using a connection object is that it could be attached to other objects. I'm imagining a cohort class that would answer the question "How should we store a cohort in R?" This cohort object might include a bunch of things like a webApiConnection, JSON, SQL, a cohort table resulting from generation/instantiation on a particular dataset, etc. Then we could do lots of stuff with that cohort object in R (eg generate, write to disk, compare, etc). We could have a similar class for a concept set. I'm probably getting ahead of myself but that's what I'm thinking about.

gowthamrao commented 4 years ago

Do environment variables set in an R session linger after the session or are they deleted at the end of the session?

I dont know the answer, but from what i read here

The objects don’t live in the environment so multiple names can point to the same object.

So the question I think is - on terminating R-session does the objects in memory also get purged.

The job of an environment is to associate, or bind, a set of names to a set of values. You can think of an environment as a bag of names:

If an object has no names pointing to it, it gets automatically deleted by the garbage collector. This process is described in more detail in gc

[Garbage Collector] (http://adv-r.had.co.nz/memory.html#gc)

Both R and the operating system are lazy: they won’t reclaim memory until it’s actually needed. R might be holding on to memory because the OS hasn’t yet asked for it back.

In some languages, you have to explicitly delete unused objects for their memory to be returned. R uses an alternative approach: garbage collection (or GC for short). GC automatically releases memory when an object is no longer used. It does this by tracking how many names point to each object, and when there are no names pointing to an object, it deletes that object.

So the question arises: do we keep track of the 'physical' memory location and check/confirm if the garbage collector has purged it.

Despite what you might have read elsewhere, there’s never any need to call gc() yourself. R will automatically run garbage collection whenever it needs more space; if you want to see when that is, call gcinfo(TRUE). The only reason you might want to call gc() is to ask R to return memory to the operating system. However, even that might not have any effect: older versions of Windows had no way for a program to return memory to the OS.

ablack3 commented 4 years ago

After some thought, conversations, and reading I think I'm going to opt for option 2 - use an environment variable for storing the authToken.

chrisknoll commented 4 years ago

But, won't that prevent you from accessing 2 webapi's on different connections/different auths?